CN117874235B

CN117874235B - Data processing system for acquiring disease name identification of electronic medical record text

Info

Publication number: CN117874235B
Application number: CN202410108326.XA
Authority: CN
Inventors: 王志鹏; 王军江
Original assignee: Qidian Zhibao Beijing Technology Co ltd
Current assignee: Qidian Zhibao Beijing Technology Co ltd
Priority date: 2024-01-25
Filing date: 2024-01-25
Publication date: 2024-06-21
Anticipated expiration: 2044-01-25
Also published as: CN117874235A

Abstract

The invention relates to the field of data processing, in particular to a data processing system for acquiring disease name identifications of electronic medical record texts, which comprises the following steps when a computer program is executed by a processor: screening to obtain a first candidate disease name list and removing duplication to obtain a first intermediate disease name list based on the similarity between the word vector of the initial keyword and the word vector of the main keyword corresponding to the main disease name, so that the first disease name identification list is obtained by combining the similarity between the initial keyword and the main keyword and the occurrence frequency of the initial keyword in the medical record text, and the accuracy of obtaining the first disease name identification is improved; and then combining the word vector of the initial keyword, the first disease name identification list, the first secondary disease name information list and the second secondary disease name information list to further acquire a target medical record name identification corresponding to the electronic medical record text, thereby improving the acquisition accuracy of the target medical record name identification.

Description

Data processing system for acquiring disease name identification of electronic medical record text

Technical Field

The invention relates to the field of data processing, in particular to a data processing system for acquiring disease name identification of an electronic medical record text.

Background

The DRG (disease diagnosis related group) is a classification coding standard specially used for a medical insurance prepayment system, the patients are classified into a plurality of diagnosis related groups according to the age, sex, hospitalization days, clinical diagnosis, symptoms, operation, disease severity, complications, and relatives of the patients, scientific measurement and calculation are carried out on the classification, and quota prepayment is given, so that the medical insurance prepayment system can assist in improving lean operation management and performance management capability of hospitals, and has wide application prospects in the medical field.

In the medical field, the DRG catalog can be divided into three layers of main, sub-and fine-mesh, and the main, sub-and fine-mesh corresponding to the medical record text are respectively obtained through the similarity between the medical record text of the patient and the main, sub-and fine-mesh names, and the detailed disease name identification is further obtained by integrating the corresponding main, sub-and fine-mesh. Because the DRG catalog comprises a plurality of main orders, each main order comprises a plurality of sub-orders, and each sub-order comprises a plurality of detail, the complexity of the disease name identification is higher, and the accuracy of the existing disease name identification acquisition method for acquiring the disease name identification is lower only according to the similarity between the medical record text and the main order, sub-order and detail names.

Therefore, how to improve the accuracy of acquiring the disease name identifier of the electronic medical record text becomes a urgent problem to be solved.

Disclosure of Invention

Aiming at the technical problems, the technical scheme adopted by the invention is a data processing system for acquiring the disease name identifications of the electronic medical record text, the system comprises a processor and a memory storing a computer program, wherein the memory also stores an initial keyword vector set A ⁰＝{A⁰ ₁,A⁰ ₂,……,A⁰ _i,……,A⁰ _m of the electronic medical record text, a main disease name information list L ⁰, a first auxiliary disease name information list L ¹ and a second auxiliary disease name information list L ², wherein A ⁰ _i refers to a word vector of an ith initial keyword of the electronic medical record text, L ⁰ comprises n preset main disease names, a main disease name identification corresponding to each main disease name and a word vector of a main keyword corresponding to the jth main disease name, i=1, 2, … …, m refers to the total number of the initial keywords of the electronic medical record text, j=1, 2, … …, n, and when the computer program is executed by the processor, the following steps are realized:

S100, according to A ⁰ and B ⁰, obtaining a main disease name similarity set C ⁰＝{C⁰ ₁,C⁰ ₂,……,C⁰ _i,……,C⁰ _m corresponding to A ⁰, wherein ,C⁰ _i＝{C⁰ _i1,C⁰ _i2,……,C⁰ _ij,……,C⁰ _in},C⁰ _ij refers to similarity between A ⁰ _i and B ⁰ _j.

S200, taking the main disease name corresponding to C ⁰ _ij meeting the requirement of C ⁰ _ij＞△C⁰ as a first candidate disease name, and obtaining a first candidate disease name list D ⁰, wherein DeltaC ⁰ is a first preset threshold.

S300, de-duplicating D ⁰ to obtain a first intermediate disease name list D ¹＝{D¹ ₁,D¹ ₂,……,D¹ _k,……,D¹ _t, where D ¹ _k refers to the kth first intermediate disease name, k=1, 2, … …, t, and t is the total number of first intermediate disease names.

S400, according to C ⁰ and D ¹, obtaining a set of intermediate similarity C ¹＝{C¹ ₁,C¹ ₂,……,C¹ _k,……,C¹ _t between the main disease names corresponding to D ¹, where ,C¹ _k＝{C¹ _k1,C¹ _k2,……,C¹ _kx,……,C¹ _kr(k)},C¹ _kx refers to the x-th main disease name similarity greater than Δc ⁰ corresponding to the main disease name corresponding to D ¹ _k in C ⁰, x=1, 2, … …, r (k), and r (k) refers to the total number of main disease name similarities greater than Δc ⁰ corresponding to the main disease name corresponding to D ¹ _k in C ⁰.

S500, according to C ¹ and A ⁰, a first occurrence frequency set E ¹＝{E¹ ₁,E¹ ₂,……,E¹ _k,……,E¹ _t corresponding to D ¹ is obtained, wherein ,E¹ _k＝{E¹ _k1,E¹ _k2,……,E¹ _kx,……,E¹ _kr(k)},E¹ _kx＝Q¹ _kx/m,Q¹ _kx is the occurrence frequency of a word vector of an initial keyword corresponding to C ¹ _kx in A ⁰.

S600, according to C ¹ and E ¹, obtaining a first selection probability set S ¹＝{S¹ ₁,S¹ ₂,……,S¹ _k,……,S¹ _t corresponding to D ¹, wherein the first selection probability set S ¹ _k＝Σ_x＝1 ^r(k)(E¹ _kx*C¹ _kx)/r (k) corresponding to D ¹ _k.

S700, taking a main disease name identifier corresponding to a first intermediate disease name corresponding to S ¹ _k which meets S ¹ _k＞△S¹ as a first disease name identifier, and acquiring a first disease name identifier list W ¹, wherein DeltaS ¹ is a second preset threshold value.

S800, according to A ⁰、W¹、L¹ and L ², obtaining a target medical record name identifier corresponding to the electronic medical record text.

Compared with the prior art, the data processing system for acquiring the disease name identification of the electronic medical record text has obvious beneficial effects, can achieve quite technical progress and practicality, has wide industrial application value, and has at least the following beneficial effects: screening to obtain a first candidate disease name list based on the similarity between the word vector of the initial keyword and the word vector of the main keyword corresponding to the main disease name, de-duplicating the first candidate disease name list to obtain a first intermediate disease name list, further obtaining a main disease name intermediate similarity set corresponding to the first intermediate disease name list and a corresponding first occurrence frequency set, and therefore representing the probability that each first intermediate disease name is selected as a first disease name identifier by combining the similarity between the initial keyword and the main keyword and the occurrence frequency of the initial keyword in a medical record text, and finally obtaining a first disease name identifier list, thereby improving the accuracy of obtaining the first disease name identifier; and then combining the word vector of the initial keyword, the first disease name identification list, the first secondary disease name information list and the second secondary disease name information list to further acquire a target medical record name identification corresponding to the electronic medical record text, thereby improving the acquisition accuracy of the target medical record name identification.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of an executing computer program of a data processing system for obtaining a disease name identifier of an electronic medical record text according to an embodiment of the present invention;

FIG. 2 is another flowchart of an executing computer program of a data processing system for obtaining a disease name identifier of an electronic medical record text according to an embodiment of the present invention;

FIG. 3 is another flowchart of an executing computer program of a data processing system for obtaining a disease name identifier of an electronic medical record text according to an embodiment of the present invention;

FIG. 4 is another flowchart of an execution computer program of a data processing system for obtaining a disease name identifier of an electronic medical record text according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The first embodiment provides a data processing system for obtaining a disease name identifier of an electronic medical record text, where the system includes a processor and a memory storing a computer program, the memory further stores an initial keyword vector set a ⁰＝{A⁰ ₁,A⁰ ₂,……,A⁰ _i,……,A⁰ _m of the electronic medical record text, a main disease name information list L ⁰, a first sub disease name information list L ¹ and a second sub disease name information list L ², where a ⁰ _i refers to a word vector of an i-th initial keyword of the electronic medical record text, L ⁰ includes n preset main disease names, a main disease name identifier corresponding to each main disease name, and a word vector set B⁰＝{B⁰ ₁,B⁰ ₂,……,B⁰ _j,……,B⁰ _n},B⁰ _j of main keywords corresponding to a j-th main disease name, i=1, 2, … …, m, m refers to a total number of the initial keywords of the electronic medical record text, j=1, 2, … …, n, and when the computer program is executed by the processor, the following steps are implemented as shown in fig. 1:

The initial keywords may be keywords obtained by extracting keywords from the electronic medical record text according to a keyword extraction algorithm; the primary disease name may refer to a disease name corresponding to a primary order, the first secondary disease name may refer to a disease name corresponding to a sub-order, the second secondary disease name may refer to a disease name corresponding to a detail, the primary disease name identifier is a unique identifier corresponding to the primary disease name, and the primary keyword corresponding to the primary disease name may be a keyword obtained by extracting a keyword from the primary disease name according to a keyword extraction algorithm. One skilled in the art knows that any keyword extraction algorithm in the prior art falls within the protection scope of the present invention, and is not described herein.

The higher the similarity between the initial keyword and the primary keyword, the higher the probability that the primary disease name identifier corresponding to the primary keyword is the first disease name identifier of the electronic medical record text. Therefore, firstly, the similarity between each initial keyword and each primary keyword is obtained, and a primary disease name similarity set C ⁰ is further obtained as a basis for judging the first disease name identification. Any similarity calculation method in the prior art is known to those skilled in the art to fall within the protection scope of the present invention, and will not be described herein.

According to the similarity between the initial keywords and the primary keywords, the possibility that the corresponding primary disease name marks are the first disease name marks of the electronic medical record text is characterized, and a reliable data basis is provided for judging the first disease name marks.

In a specific embodiment, the memory further stores electronic medical record text and primary keywords corresponding to each primary disease name, and the a ⁰ and the B ⁰ are obtained through the following steps:

S10, acquiring a target keyword set A= { A ₁,A₂,……,A_i,……,A_m},A_i of the electronic medical record text, wherein the target keyword set A= { A ₁,A₂,……,A_i,……,A_m},A_i refers to an ith target keyword;

S20, inputting the A into a word vector model, and acquiring a target keyword vector set A⁰＝{A⁰ ₁,A⁰ ₂,……,A⁰ _i,……,A⁰ _m},A⁰ _i corresponding to the A, wherein the target keyword vector set A⁰＝{A⁰ ₁,A⁰ ₂,……,A⁰ _i,……,A⁰ _m},A⁰ _i is a target keyword vector corresponding to the A _i;

S30, inputting the primary key corresponding to each preset primary disease name into the word vector model to obtain a preset primary disease name vector set B ⁰＝{B⁰ ₁,B⁰ ₂,……,B⁰ _j,……,B⁰ _n.

Any word vector model in the prior art is known by those skilled in the art to fall within the protection scope of the present invention, and will not be described herein.

In a specific embodiment ,C⁰ _ij＝(A⁰ _i·B⁰ _j)/(||A⁰ _i||×||B⁰ _j||),, wherein i a ⁰ _i is the modulus of a ⁰ _i and i B ⁰ _j is the modulus of B ⁰ _j.

For two vectors, the degree of similarity between the two vectors may be expressed according to the cosine of the included angle between the vectors, so in this embodiment, the cosine similarity value between a ⁰ _i and B ⁰ _j is calculated, and C ⁰ _ij is obtained as the similarity between a ⁰ _i and B ⁰ _j.

When C ⁰ _ij＞△C⁰, it may be indicated that the degree of similarity between the electronic medical record text and the primary disease name corresponding to C ⁰ _ij is higher, and the probability that the primary disease name corresponding to C ⁰ _ij is the first disease name identifier of the electronic medical record text is higher, so in this embodiment, the primary disease name corresponding to C ⁰ _ij of C ⁰ _ij＞△C⁰ is screened out as the first candidate disease name, and the first candidate disease name list D ⁰ is further obtained as a basis for determining the first disease name identifier.

The specific value of Δc ⁰ may be set by the practitioner according to the actual situation.

Above-mentioned, through carrying out the size comparison to main class disease name similarity and first default threshold value, the main class disease name that the degree of similarity is higher with the electronic medical record text is selected from all main class disease names, has reduced the acquisition scope of first disease name sign, has improved the acquisition accuracy of first disease name sign.

Each initial keyword may appear one or more times in the electronic medical record text, and the primary disease names corresponding to the primary disease name similarity of Δc ⁰ corresponding to the different initial keywords may be identical, so each first intermediate disease name may appear one or more times in D ¹.

In order to avoid repeated calculation, reduce the memory space occupation and improve the acquisition efficiency of the first disease name identifier, the duplicate removal of D ⁰ is performed to obtain a first intermediate disease name list D ¹, which is used as a basis for judging the first disease name identifier.

Above-mentioned, remove duplicate to D ⁰ and obtain first intermediate disease name list D ¹, avoided the repeated calculation of duplicate data, further reduced the acquisition scope of first disease name sign, improved the acquisition efficiency of first disease name sign.

Among all the main disease names similarity corresponding to the D ¹ _k in the C ⁰, r (k) main disease names similarity larger than DeltaC ⁰ are screened out, and then C ¹ _k is obtained, so that the similarity degree of the D ¹ _k and the electronic medical record text is represented based on C ¹ _k, and a data basis is provided for further screening out the first disease name identification of the electronic medical record text.

It will be appreciated that the higher the intermediate similarity of the primary disease names corresponding to D ¹ _k, the higher the probability that D ¹ _k is selected as the first disease name identifier.

Above-mentioned, select the middle similarity of all main class disease names that each first intermediate disease name corresponds from C ⁰, can characterize the probability that each first intermediate disease name is selected as first disease name sign to can further screen all first intermediate disease names based on C ¹, finally acquire first disease name sign list, improve the acquisition accuracy of first disease name sign.

The more the number of occurrences of the initial keyword in the medical record text, the higher the importance degree of the initial keyword in obtaining the first disease name identifier is indicated, so that the first occurrence frequency E ¹ _kx corresponding to C ¹ _kx is obtained according to the ratio of the number of occurrences of the word vector of the initial keyword corresponding to C ¹ _kx in a ⁰ to the total number m of the initial keywords, so as to characterize the probability that each first intermediate disease name is selected as the first disease name identifier in combination with Q ¹ _kx.

The occurrence times of the initial keywords in the medical record text are used as the acquisition basis of the probability that each first intermediate disease name is selected as the first disease name identifier, so that all first intermediate disease names can be further screened based on E ¹, a first disease name identifier list is finally acquired, and the acquisition accuracy of the first disease name identifiers is improved.

The specific value of Δs ¹ may be set by the practitioner according to the actual situation.

In one embodiment, S800 specifically includes the following steps, as shown in fig. 2:

S810, according to A ⁰、W¹ and L ¹, obtaining a second disease name identification list W ² corresponding to the electronic medical record text;

S820, according to A ⁰、W² and L ², obtaining a third disease name identifier w ³ corresponding to the electronic medical record text;

S830, according to W ¹、W² and W ³, obtaining the name identification of the target medical record corresponding to the electronic medical record text.

In an embodiment, L ¹ includes P preset first minor disease names, a first minor disease name identifier corresponding to each first minor disease name, a main disease name corresponding to each first minor disease name, and a first minor disease name vector set B¹＝{B¹ ₁,B¹ ₂,……,B¹ _p,……,B¹ _P},B¹ _u refer to a word vector of a first minor keyword corresponding to a P-th first minor disease name, where p=1, 2, … …, P, S810 specifically includes the following steps, as shown in fig. 3:

S811, determining a first subsidiary disease name corresponding to the main disease name corresponding to each first disease name identifier as a first subsidiary disease name to be selected according to W ¹ and L ¹;

S812, according to all the first secondary disease name vectors corresponding to the first secondary disease names, a first secondary disease name vector set B ^1d＝{B^1d ₁,B^1d ₂,……,B^1d _v,……,B^1d _y is obtained from B ¹, where B ^1d _v refers to the word vector of the first secondary keyword corresponding to the v-th first secondary disease name, v=1, 2, … …, y, and y refers to the total number of the first secondary disease names;

S813, according to A ⁰ and B ^1d, obtaining a first secondary disease name similarity set C ²＝{C² ₁,C² ₂,……,C² _i,……,C² _m corresponding to A ⁰, wherein ,C² _i＝{C² _i1,C² _i2,……,C² _iv,……,C² _iy},C² _ij refers to similarity between A ⁰ _i and B ^1d _v;

s814, taking the first auxiliary disease name corresponding to C ² _iv meeting the requirement of C ² _iv＞△C² as a second candidate disease name, and obtaining a second candidate disease name list D ², wherein DeltaC ² is a third preset threshold;

S815, de-duplicating D ² to obtain a second intermediate disease name list D ³＝{D³ ₁,D³ ₂,……,D³ _h,……,D³ _H }, where D ³ _h refers to the H second intermediate disease name, h=1, 2, … …, H is the total number of second intermediate disease names;

s816, obtaining a set of intermediate similarity C ³＝{C³ ₁,C³ ₂,……,C³ _h,……,C³ _H of the first auxiliary disease names corresponding to D ³ according to C ² and D ³, wherein ,C³ _h＝{C³ _h1,C³ _h2,……,C³ _hf,……,C³ _hr(h)},C³ _hf refers to the similarity of the f first auxiliary disease names corresponding to D ³ _h and greater than DeltaC ² in C ², f=1, 2, … …, r (h), and r (h) refers to the total number of the first auxiliary disease name similarities corresponding to D ³ _h and greater than DeltaC ² in C ²;

s817, according to C ³ and A ⁰, obtaining a second occurrence frequency set E ²＝{E² ₁,E² ₂,……,E² _h,……,E² _H corresponding to D ³, wherein ,E² _h＝{E² _h1,E² _h2,……,E² _hf,……,E² _hr(h)},E² _hf＝Q² _hf/m,Q² _hf is the occurrence frequency of the word vector of the initial keyword corresponding to C ³ _hf in A ⁰;

S818, according to C ³ and E ², obtaining a second selection probability set S ²＝{S² ₁,S² ₂,……,S² _h,……,S² _H corresponding to D ³, wherein the second selection probability set S ² _h＝Σ_f＝1 ^r(h)(E² _hf*C³ _hf)/r (h) corresponding to D ³ _h;

And S819, taking the first auxiliary disease name identifier corresponding to the second intermediate disease name corresponding to S ² _h of S ² _h＞△S² as a second disease name identifier, and acquiring a second disease name identifier list W ², wherein DeltaS ² is a fourth preset threshold.

The first secondary disease name identifier is a unique identifier corresponding to the first secondary disease name, and the first secondary keyword corresponding to the first secondary disease name may be a keyword obtained by extracting a keyword from the first secondary disease name according to a keyword extraction algorithm. Each primary disease name corresponds to a plurality of first secondary disease names.

The specific values of Δc ² and Δs ² may be set by the practitioner according to the actual situation.

Above, all the first minor disease names are screened according to W ¹, and the first minor disease name corresponding to the major disease name corresponding to each first disease name identifier in W ¹ is determined as the first minor disease name to be selected, so that the acquisition range of the second disease name identifier is narrowed; then, based on the similarity between the word vector of the primary keyword and the word vector of the first secondary keyword corresponding to the first secondary disease name, screening to obtain a second candidate disease name list D ², de-duplicating D ² to obtain a second intermediate disease name list D ³, and further obtaining a first secondary disease name intermediate similarity set C ³ and a corresponding second occurrence frequency set E ² corresponding to D ³, so that the similarity between the initial keyword and the first secondary keyword and the occurrence frequency of the initial keyword in the medical record text can be combined to represent the probability that each second intermediate disease name is selected as a second disease name identifier, so as to screen all second intermediate disease names, finally obtain a second disease name identifier list, and further improve the accuracy of obtaining the second disease name identifier.

In one embodiment, Δs ¹＞△S².

The Δs ¹ is a threshold for screening the first disease name identifier, the Δs ² is a threshold for screening the second disease name identifier, and the second disease name identifier is obtained by further combining the similarity between the initial keyword and the first sub-category keyword and the occurrence number of the initial keyword in the medical record text based on the first disease name identifier, if the reliability of the first disease name identifier is lower, the reliability of the second disease name identifier is lower, so that a larger threshold needs to be set for judging the first disease name identifier, so that the first disease name identifier and the second disease name identifier have higher accuracy.

In a specific embodiment, L ² includes G preset second secondary disease names, a second secondary disease name identifier corresponding to each second secondary disease name, and the first secondary disease name and the second secondary disease name vector set B²＝{B² ₂,B² ₂,……,B² _g,……,B² _G},B² _g corresponding to each second secondary disease name refer to word vectors of the second secondary keywords corresponding to the G second secondary disease name, where g=1, 2, … …, G, and S820 specifically include the following steps as shown in fig. 4:

S822, determining a second minor disease name corresponding to the first minor disease name corresponding to each second disease name identifier as a second minor disease name to be selected according to W ² and L ²;

S822, obtaining a second to-be-selected sub-class disease name vector set B ^2f＝{B^2f ₂,B^2f ₂,……,B^2f _u,……,B^2f _Y from B ² according to second sub-class disease name vectors corresponding to all second to-be-selected sub-class disease names, wherein B ^2f _v refers to word vectors of second sub-class keywords corresponding to the u-th second to-be-selected sub-class disease names, u=1, 2, … … and Y refer to the total number of the second to-be-selected sub-class disease names;

S823, according to A ⁰ and B ^2f, obtaining a second secondary disease name similarity set C ⁴＝{C⁴ ₁,C⁴ ₂,……,C⁴ _i,……,C⁴ _m corresponding to A ⁰, wherein ,C⁴ _i＝{C⁴ _i1,C⁴ _i2,……,C⁴ _iu,……,C⁴ _iY},C⁴ _ij refers to similarity between A ⁰ _i and B ^2f _u;

S824, taking the second auxiliary disease name corresponding to C ⁴ _iu meeting the requirement of C ⁴ _iu＞△C⁴ as a third candidate disease name, and obtaining a third candidate disease name list D ³, wherein DeltaC ⁴ is a fifth preset threshold;

s825, de-duplicating D ³ to obtain a third intermediate disease name list D ⁴＝{D⁴ ₁,D⁴ ₂,……,D⁴ _α,……,D⁴ _β }, where D ⁴ _α refers to the α third intermediate disease name, α=1, 2, … …, β, β is the total number of third intermediate disease names;

S826, obtaining a set of intermediate similarity C ⁵＝{C⁵ ₁,C⁵ ₂,……,C⁵ _α,……,C⁵ _β of the second secondary disease names corresponding to D ⁴ according to C ⁴ and D ⁴, wherein ,C⁵ _α＝{C⁵ _α1,C⁵ _α2,……,C⁵ _αγ,……,C⁵ _αr(α)},C⁵ _αγ refers to the similarity of the gamma second secondary disease names corresponding to D ⁴ _α and greater than DeltaC ⁴ in C ⁴, gamma=1, 2, … …, r (alpha), and r (alpha) refers to the total number of the second secondary disease name similarities corresponding to D ⁴ _α and greater than DeltaC ⁴ in C ⁴;

S827, according to C ⁵ and A ⁰, obtaining a third occurrence frequency set E ³＝{E³ ₁,E³ ₂,……,E³ _α,……,E³ _β corresponding to D ⁴, wherein ,E³ _α＝{E³ _α1,E³ _α2,……,E³ _αγ,……,E³ _αr(α)},E³ _αγ＝Q³ _αγ/m,Q³ _αγ is the occurrence frequency of a word vector of an initial keyword corresponding to C ⁵ _αγ in A ⁰;

S828, according to C ⁵ and E ³, obtaining a third selection probability set S ³＝{S³ ₁,S³ ₂,……,S³ _α,……,S³ _β corresponding to D ⁴, wherein third selection probability S ³ _α＝Σ_γ＝1 ^r(α)(E³ _αγ*C⁵ _αγ)/r (alpha) corresponding to D ⁴ _α;

S8239, the second sub-disease name identifier corresponding to the third intermediate disease name corresponding to max (S ³) is used as the third disease name identifier w ³.

The second secondary disease name identifier is a unique identifier corresponding to a second secondary disease name, and the second secondary keyword corresponding to the second secondary disease name may be a keyword obtained by extracting a keyword from the second secondary disease name according to a keyword extraction algorithm. Each first minor disease name corresponds to a plurality of second minor disease names.

The specific value of Δc ⁴ may be set by the practitioner according to the actual situation.

Since the second secondary disease name is the name of the last stage, after the third selection probability set S ³ corresponding to D ⁴ is obtained, the second secondary disease name identifier corresponding to the third intermediate disease name corresponding to max (S ³) is used as the third disease name identifier w ³.

Screening all the first auxiliary disease names according to W ², determining the second auxiliary disease name corresponding to the first auxiliary disease name corresponding to each second disease name identifier in W ² as the second auxiliary disease name to be selected, and reducing the acquisition range of the third disease name identifier; then, screening and obtaining a third candidate disease name list D ³ based on the similarity between the word vector of the primary keyword and the word vector of the second secondary keyword corresponding to the second to-be-selected secondary disease name, de-duplicating D ³ to obtain a third intermediate disease name list D ⁴, further obtaining a second secondary disease name intermediate similarity set C ⁵ corresponding to D ⁴ and a corresponding third occurrence frequency set E ³, so that the similarity between the initial keyword and the second secondary keyword and the occurrence frequency of the initial keyword in the medical record text can be combined to represent the probability that each third intermediate disease name is selected as a third disease name identifier, screening all third intermediate disease names, finally obtaining a third disease name identifier, and improving the accuracy of obtaining the third disease name identifier.

In one embodiment, S830 specifically includes the following steps:

S831, the first auxiliary disease name identifier corresponding to w ³ is used as a second disease name identifier w ²;

S832, using the main disease name identifier corresponding to w ² as a first disease name identifier w ¹;

s833, splicing the w ¹、w² and the w ³ according to a preset sequence, and obtaining a target medical record name mark w ¹w²w³ corresponding to the electronic medical record text.

In the embodiment, based on the similarity between the word vector of the initial keyword and the word vector of the primary keyword corresponding to the primary disease name, a first candidate disease name list D ⁰ is screened, the duplication of D ⁰ is removed to obtain a first intermediate disease name list D ¹, and a primary disease name intermediate similarity set C ¹ corresponding to D ¹ and a corresponding first occurrence frequency set E ¹ are further obtained, so that the similarity between the initial keyword and the primary keyword and the occurrence frequency of the initial keyword in the medical record text can be combined to represent the probability that each first intermediate disease name is selected as a first disease name identifier, so that all first intermediate disease names are screened, and finally, a first disease name identifier list is obtained, thereby improving the accuracy of obtaining the first disease name identifier; then, combining the word vector of the initial keyword, the first disease name identification list, the first subsidiary disease name information list L ¹ and the second subsidiary disease name information list L ², further acquiring the target medical record name identification corresponding to the electronic medical record text, and improving the acquisition accuracy of the target medical record name identification.

While certain specific embodiments of the invention have been described in detail by way of example, it will be appreciated by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the invention. Those skilled in the art will also appreciate that many modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

1. The data processing system for acquiring the disease name identification of the electronic medical record text is characterized by comprising a processor and a memory storing a computer program, wherein the memory also stores an initial keyword vector set A ⁰＝{A⁰ ₁,A⁰ ₂,……,A⁰ _i,……,A⁰ _m of the electronic medical record text, a main disease name information list L ⁰, a first auxiliary disease name information list L ¹ and a second auxiliary disease name information list L ², A ⁰ _i refers to a word vector of an ith initial keyword of the electronic medical record text, L ⁰ comprises n preset main disease names, a main disease name identification corresponding to each main disease name and a word vector of a main keyword corresponding to the jth main disease name, i=1, 2, … …, m, m refers to the total number of the initial keywords of the electronic medical record text, j=1, 2, … …, n, when the computer program is executed by the processor, the following steps are realized:

S100, obtaining a main disease name similarity set C ⁰＝{C⁰ ₁,C⁰ ₂,……,C⁰ _i,……,C⁰ _m corresponding to A ⁰ according to A ⁰ and B ⁰, wherein ,C⁰ _i＝{C⁰ _i1,C⁰ _i2,……,C⁰ _ij,……,C⁰ _in},C⁰ _ij refers to similarity between A ⁰ _i and B ⁰ _j;

S200, taking the main disease name corresponding to C ⁰ _ij meeting the requirement of C ⁰ _ij＞△C⁰ as a first candidate disease name, and acquiring a first candidate disease name list D ⁰, wherein DeltaC ⁰ is a first preset threshold;

S300, de-duplicating the D ⁰ to obtain a first intermediate disease name list D ¹＝{D¹ ₁,D¹ ₂,……,D¹ _k,……,D¹ _t, wherein D ¹ _k refers to the kth first intermediate disease name, k=1, 2, … …, t, t is the total number of the first intermediate disease names;

S400, obtaining a main disease name intermediate similarity set C ¹＝{C¹ ₁,C¹ ₂,……,C¹ _k,……,C¹ _t corresponding to D ¹ according to C ⁰ and D ¹, wherein ,C¹ _k＝{C¹ _k1,C¹ _k2,……,C¹ _kx,……,C¹ _kr(k)},C¹ _kx refers to the x-th main disease name similarity greater than DeltaC ⁰ corresponding to the main disease name corresponding to D ¹ _k in C ⁰, x=1, 2, … …, r (k), and r (k) refers to the total number of main disease name similarities greater than DeltaC ⁰ corresponding to the main disease name corresponding to D ¹ _k in C ⁰;

S500, according to C ¹ and A ⁰, obtaining a first occurrence frequency set E ¹＝{E¹ ₁,E¹ ₂,……,E¹ _k,……,E¹ _t corresponding to D ¹, wherein ,E¹ _k＝{E¹ _k1,E¹ _k2,……,E¹ _kx,……,E¹ _kr(k)},E¹ _kx＝Q¹ _kx/m,Q¹ _kx is the occurrence frequency of a word vector of an initial keyword corresponding to C ¹ _kx in A ⁰;

S600, according to C ¹ and E ¹, obtaining a first selection probability set S ¹＝{S¹ ₁,S¹ ₂,……,S¹ _k,……,S¹ _t corresponding to D ¹, wherein the first selection probability set S ¹ _k＝Σ_x＝1 ^r(k)(E¹ _kx*C¹ _kx)/r (k) corresponding to D ¹ _k;

S700, taking a main disease name identifier corresponding to a first intermediate disease name corresponding to S ¹ _k which meets S ¹ _k＞△S¹ as a first disease name identifier, and acquiring a first disease name identifier list W ¹, wherein DeltaS ¹ is a second preset threshold;

S800, according to A ⁰、W¹、L¹ and L ², obtaining a target medical record name identifier corresponding to the electronic medical record text, wherein S800 specifically comprises the following steps:

s830, acquiring a target medical record name identifier corresponding to the electronic medical record text according to W ¹、W² and W ³;

S830 specifically includes the following steps:

2. The data processing system of claim 1, wherein the memory further stores electronic medical record text and primary keywords corresponding to each primary disease name, and wherein a ⁰ and B ⁰ are obtained by:

S10, acquiring a target keyword set A= { A ₁,A₂,……,A_i,……,A_m},A_i of the target electronic medical record text, wherein the target keyword set A= { A ₁,A₂,……,A_i,……,A_m},A_i refers to an ith target keyword;

3. The data processing system of claim 1 wherein ,C⁰ _ij＝(A⁰ _i·B⁰ _j)/(‖A⁰ _i‖×‖B⁰ _j‖), is wherein i a ⁰ _i is the modulus of a ⁰ _i and ii B ⁰ _j is the modulus of B ⁰ _j.

4. The data processing system according to claim 1, wherein L ¹ includes P preset first minor disease names, a first minor disease name identifier corresponding to each first minor disease name, a major disease name corresponding to each first minor disease name, and a first minor disease name vector set B¹＝{B¹ ₁,B¹ ₂,……,B¹ _p,……,B¹ _P},B¹ _p refer to a word vector of a first minor keyword corresponding to a P-th first minor disease name, and p=1, 2, … …, P, S810 specifically includes the steps of:

S813, according to A ⁰ and B ^1d, obtaining a first secondary disease name similarity set C ²＝{C² ₁,C² ₂,……,C² _i,……,C² _m corresponding to A ⁰, wherein ,C² _i＝{C² _i1,C² _i2,……,C² _iv,……,C² _iy},C² _iv refers to similarity between A ⁰ _i and B ^1d _v;

5. The data processing system of claim 4, wherein Δs ¹＞△S².

6. The data processing system according to claim 4, wherein L ² includes G preset second-secondary disease names, a second-secondary disease name identifier corresponding to each second-secondary disease name, a first-secondary disease name corresponding to each second-secondary disease name, and a second-secondary disease name vector set B²＝{B² ₂,B² ₂,……,B² _g,……,B² _G},B² _g refer to a word vector of a second-secondary keyword corresponding to a G-th second-secondary disease name, and g=1, 2, … …, G, S820 specifically includes the steps of:

S822, obtaining a second to-be-selected sub-class disease name vector set B ^2f＝{B^2f ₂,B^2f ₂,……,B^2f _u,……,B^2f _Y from B ² according to second sub-class disease name vectors corresponding to all second to-be-selected sub-class disease names, wherein B ^2f _u refers to word vectors of second sub-class keywords corresponding to the u-th second to-be-selected sub-class disease names, u=1, 2, … … and Y refer to the total number of the second to-be-selected sub-class disease names;

s823, according to A ⁰ and B ^2f, obtaining a second secondary disease name similarity set C ⁴＝{C⁴ ₁,C⁴ ₂,……,C⁴ _i,……,C⁴ _m corresponding to A ⁰, wherein ,C⁴ _i＝{C⁴ _i1,C⁴ _i2,……,C⁴ _iu,……,C⁴ _iY},C⁴ _iu refers to similarity between A ⁰ _i and B ^2f _u;