Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the medical field, because of the huge data volume of medical record data and the lack of strict data monitoring in the early stage, information in medical records is very redundant and messy. In the case of "cold", the diagnosis names in practice may be "cold", "acute upper respiratory infection", "upper infection", "acute upper infection", etc., and different diagnosis names may actually indicate the same disease. In order to facilitate the inquiry, management and use of the medical record data in the later period, the standardized diagnosis names in the medical record are required.
Currently, the standardization of the diagnosis names mainly depends on manual labeling, or by comparing the similarity of two diagnosis names with a preset similarity threshold, it is determined whether the two diagnosis names represent the same disease. The method has the advantages that a large amount of manpower is consumed, the efficiency is low, the subjectivity of the labeling result is strong, and the accuracy is low; the latter completely depends on the judgment of the similarity threshold, but the current technology cannot ensure the accuracy of setting the similarity threshold, and the relation between diagnostic names determined by only the similarity threshold is not scientific, so that the accuracy and reliability of standardization cannot be ensured. Therefore, the embodiment of the invention provides a method for standardizing the diagnosis name so as to improve the accuracy and reliability of the standardization of the diagnosis name.
Fig. 1 is a schematic flow chart of a method for standardizing a diagnostic name according to an embodiment of the present invention, as shown in fig. 1, the method includes:
at step 110, a plurality of diagnostic names are determined.
The diagnosis name here is a diagnosis name that needs to be standardized, and the diagnosis name may be extracted from medical record data.
And step 120, adjusting the candidate synonym relationship between every two of the plurality of diagnosis names based on the similarity and the medical relationship between every two of the plurality of diagnosis names to obtain the final synonym relationship between every two of the plurality of diagnosis names.
Specifically, for any two diagnostic names, the synonym relationship between the two diagnostic names may be "yes" or "no", where "yes" indicates that the two diagnostic names are synonyms, and "no" indicates that the two diagnostic names are not synonyms. The candidate synonym relationship represents a synonym relationship before and during adjustment of the synonym relationship between every two diagnostic names based on the similarity between every two diagnostic names and the medical relationship, and the final synonym relationship represents a synonym relationship after adjustment is completed. The candidate synonym relationship between each two diagnostic names may be randomly generated before the candidate synonym relationship between each two diagnostic names is adjusted based on the similarity and medical relationship between each two diagnostic names.
The similarity between any two diagnosis names is used to represent the similarity of information corresponding to the two diagnosis names, where the information may specifically be medical record data corresponding to the two diagnosis names, or may also be specific attributes of the medical record data corresponding to the two diagnosis names, such as chief complaints, current medical history, past medical history, allergy history, physical examination, auxiliary examination, and office of medical treatment, and this is not specifically limited in the embodiments of the present invention.
The medical relationship between any two diagnosis names is used for representing the relationship embodied by the two diagnosis names in the medical field, and the medical relationship can be the upper and lower relationship of diseases corresponding to the two diagnosis names in a disease system in the medical field, the sequence of the diseases corresponding to the two diagnosis names embodied on a patient, or the difference and the same of the common attack time of the diseases corresponding to the two diagnosis names.
The similarity and medical relationship between any two diagnostic names is associated with the synonym relationship between the two diagnostic names. For example, the higher the similarity is, the higher the probability that the synonym relationship between the two diagnosis names is "yes", and if the diseases corresponding to the two diagnosis names in the medical relationship are presented on the patient in chronological order or concurrently, or if the diseases corresponding to the two diagnosis names have a top-bottom relationship, the probability that the synonym relationship between the two diagnosis names is "yes" is reduced.
Here, the adjustment of the candidate synonym relationship is performed by the mutual restriction of the similarity between every two diagnosis names and the medical relationship, thereby avoiding the disadvantage caused by judging the synonym relationship only by the similarity.
For example, for any two diagnostic names "acute upper respiratory infection" and "viral pharyngitis," the candidate synonym relationship between the two is "yes" and the similarity between the two is 70%, and "acute upper respiratory infection" is a generic concept of "viral pharyngitis" in the medical relationship between the two, so although the similarity between the two is high, the candidate synonym relationship between the two is still adjusted to obtain the final synonym relationship "no" due to the limitation of the upper-lower relationship.
And step 130, determining a standardized diagnosis name corresponding to each diagnosis name based on the final synonym relation between every two diagnosis names.
Specifically, after the final synonym relationship between every two diagnostic names is determined, a plurality of groups of diagnostic names which are synonyms of each other can be obtained, for example, "myelopathy," "spinal cord dysfunction," "myelopathy," and "neurogenic myelopathy" are a group of synonyms, one diagnostic name can be selected from the groups of synonymous diagnostic names as a standardized diagnostic name of the group of synonyms, and the standardized diagnostic name corresponding to each diagnostic name can be obtained based on the same method, so that the standardization of the diagnostic names is realized.
According to the method provided by the embodiment of the invention, the synonym relation between the diagnosis names is adjusted based on the similarity and the medical relation between every two diagnosis names, so that the standardization of the diagnosis names is realized, the medical knowledge is integrated into the determination process of the synonym relation, and the synonym relation and the similarity are mutually restricted, so that the accuracy and the reliability of the standardization of the diagnosis names are improved.
Based on the above embodiment, the medical relationship between any two diagnosis names includes at least one of a time-series relationship, an upper-lower relationship, and a difference in distribution of the onset time between the two diagnosis names.
Specifically, for any two diagnosis names, the time sequence relationship of the two diagnosis names, that is, the sequence of the diseases corresponding to the two diagnosis names on the same patient. The time sequence relation can be extracted from a large amount of medical record data, and in the medical records of the same patient, if another diagnosis name is necessarily present in the previous period before any diagnosis name is present, the two diagnosis names are most likely not synonyms. Here, the length of time for determining whether a timing relationship exists between two diagnostic names may be determined based on business experience.
The upper and lower relationship of any two diagnosis names, that is, the upper and lower relationship of the disease corresponding to two diagnosis names in the disease system in the medical field, if any diagnosis name corresponding symptom word set contains another diagnosis name corresponding symptom word set, the two diagnosis names may have the upper and lower relationship, and most likely are not synonyms.
The distribution difference of the onset time of any two diagnosis names, namely the difference of the onset time distribution of the diseases corresponding to the two diagnosis names, can be divided according to the month, and can also be divided according to the quarter or other time units. Two diagnostic names herein are most likely not synonyms if there is a significant difference in the distribution of the disease for the two diagnostic names over a preset time unit.
Based on any of the above embodiments, fig. 2 is a schematic flow chart of a final synonym relationship determining method provided by the embodiment of the present invention, as shown in fig. 2, step 120 specifically includes:
and step 121, adjusting the candidate synonym relation between every two diagnostic names in the plurality of diagnostic names until the corresponding obtained global function value obtains the maximum value.
And step 122, taking the candidate synonym relation between every two diagnosis names corresponding to the maximum value as the final synonym relation between every two diagnosis names.
Wherein the global function value is determined based on the reference function value and the medical penalty function value of every two diagnosis names; the reference function value of any two diagnosis names is determined based on the candidate synonym relationship and the similarity between the two diagnosis names, and the medical penalty function value of any two diagnosis names is determined based on the candidate synonym relationship and the medical relationship between the two diagnosis names.
Specifically, the purpose of adjusting the candidate synonym relationship is to maximize the global function value, and the final synonym relationship between every two diagnostic names is the candidate synonym relationship between every two diagnostic names corresponding to the maximum global function value.
For any two diagnostic names, a reference function value for both may be derived based on the candidate synonym relationship and similarity between the two. Since the higher the similarity is, the higher the probability that the candidate synonym relationship between the two diagnostic names is "yes", and the lower the similarity is, the higher the probability that the candidate synonym relationship between the two diagnostic names is "no", the reference function value determined based on the above rule may be regarded as the score of the rule between the candidate synonym relationship and the similarity, and when the candidate synonym relationship is "yes", the higher the similarity is, the higher the reference function value is.
For any two diagnostic names, the medical penalty function values for both may be derived based on the candidate synonym relationship and the medical relationship between the two. Here, the medical penalty function value may be regarded as a penalty score based on a restriction of the medical relationship to the candidate synonym relationship. When a medical relation limiting the candidate synonym relation of the two is 'no' exists between the two, the medical penalty function value can be correspondingly increased. For example, if the diseases corresponding to the two diagnosis names in the medical relationship are presented in the patient in a sequential or concurrent manner, the probability that the candidate synonym relationship between the two diagnosis names is "yes" is reduced, and if the candidate synonym relationship is "yes", the medical penalty function value is increased.
Based on the reference function value and the medical penalty function value for every two diagnostic names, a global function value can be determined. Wherein, the higher the reference function value of every two diagnosis names is, the higher the global function value is; the higher the medical penalty function value for every two diagnostic names, the lower the global function value. Under the condition that the reference function value and the medical punishment value are restricted with each other, the candidate synonym relation between every two diagnosis names is adjusted to realize the maximization of the global function value, and the candidate synonym relation between every two diagnosis names corresponding to the maximum global function value is used as the final synonym relation between every two diagnosis names.
According to the method provided by the embodiment of the invention, the medical penalty function value is determined based on the candidate synonym relation and the medical relation among the diagnosis names, so that the medical penalty function value is applied to the determination of the global function value, and the value of the candidate synonym relation is restricted through the medical relation, so that the accurate determination of the synonym relation among the diagnosis names is realized.
Based on any of the above embodiments, the medical penalty function value includes at least one of a disease timing penalty function value, a superior-inferior relationship penalty function value, and a time distribution difference penalty function value; the disease time sequence penalty function value of any two diagnosis names is determined based on the time sequence relation between the two diagnosis names and the candidate synonym relation; the penalty function value of the upper and lower relations of any two diagnosis names is determined based on the upper and lower relations between the two diagnosis names and the candidate synonym relation; the time distribution difference penalty function value of any two diagnosis names is determined based on the time distribution difference of onset between the two diagnosis names and the candidate synonym relationship.
Specifically, if the timing relationship between any two diagnosis names is "present" and the candidate synonym relationship is "yes", it is obvious that the timing relationship is contrary to the candidate synonym relationship, and the disease timing penalty function value is set to a preset value correspondingly, so that the disease timing penalty comes into effect to constrain the global function value; in other cases, for example, when the timing relationship is "present" and the candidate synonym relationship is "no", or when the timing relationship is "absent", the disease timing penalty function value is set to zero, and the disease timing penalty does not take effect.
If the superior-inferior relation between any two diagnosis names is 'existing' and the candidate synonym relation is 'yes', obviously, the superior-inferior relation and the candidate synonym relation are contrary, and the superior-inferior relation penalty function value is correspondingly set as a preset value, so that the superior-inferior relation penalty comes into effect to restrict the global function value; in other cases, for example, when the superior-inferior relation is "present" and the candidate synonym relation is "no", or when the superior-inferior relation is "absent", the superior-inferior relation penalty function value is set to zero, and the superior-inferior relation penalty is not valid.
If the difference of the disease attack time distribution between any two diagnosis names is 'present' and the candidate synonym relationship is 'yes', obviously, the difference of the disease attack time distribution is contrary to the candidate synonym relationship, and the time distribution difference penalty function value is correspondingly set to be a preset value, so that the time distribution difference penalty comes into effect to restrict the global function value; in other cases, for example, when the difference in the occurrence time distribution is "present" and the candidate synonym relationship is "no", or when the difference in the occurrence time distribution is "absent", the time distribution difference penalty function value is set to zero, and the time distribution difference penalty does not take effect.
Based on any of the above embodiments, for any two diagnostic names p and q, the candidate synonym relationship between p and q is denoted as Exist (p, q), where Exist (p, q) ═ 1 corresponds to the candidate synonym relationship being "yes" and Exist (p, q) ═ 0 corresponds to the candidate synonym relationship being "no". The similarity between p and q can be expressed as Prob (p, q), and the resulting p and q reference function values S can be expressed as:
S=Prob(p,q)*Exist(p,q)
based on any of the above embodiments, the timing relationship between p and q is represented as T (p, q):
from this, the disease timing penalty function values S1 for p and q are expressed as:
S1=T(p,q)*(T(p,q)∧Exist(p,q))
wherein T (p, q) ^ exists (p, q) represents the intersection of the timing relationship and the candidate synonym relationship, and if and only if T (p, q) and Exist (p, q) are both 1, then S1 is 1, the disease timing penalty is valid, otherwise, S1 is 0, the disease timing penalty is invalid.
Based on any of the above embodiments, the superior-inferior relationship between p and q is expressed as
Wherein set (P) and set (Q) are the symptom word sets of p and q, respectively.
From this, the upper and lower penalty function values S2 for p and q are expressed as:
in the formula (I), the compound is shown in the specification,
representing the intersection of the superior-inferior relationship and the candidate synonym relationship, if and only if
And Exist (p, q) are both 1, the upper and lower relationship penalty is valid when S2 is 1, otherwise, the upper and lower relationship penalty is invalid when S2 is 0.
Based on any of the above examples, the difference in the temporal distribution of onset between p and q is expressed as R (p, q):
from this, the time distribution difference penalty function values S3 for p and q are expressed as:
S3=R(p,q)*(R(p,q)∧Exist(p,q))
in the formula, R (p, q) ^ Exist (p, q) represents the intersection of the incidence time distribution difference and the candidate synonym relationship, and if and only if R (p, q) and Exist (p, q) are both 1, S3 is 1, the time distribution difference penalty is valid, otherwise, S3 is 0, the time distribution difference penalty is invalid.
Based on any of the above embodiments, the value of R (p, q) can be compared with the standard deviation diff (p, q) of the difference between the time distributions of onset of two diagnostic names and the preset threshold value of standard deviation, which is specifically represented as:
wherein diff (p, q) ═ std (monthrate (p) -monthrate (q)), monthrate (p) and monthrate (q) respectively indicate the incidence of the diagnosis names p and q in each natural month in a period of years, and monthrate (p) and monthrate (q) are 12-dimensional vectors.
The standard deviation threshold can be obtained by the following formula:
wherein D is a set including each of the diagnosis names, and N is the number of the diagnosis names included in D.
For example, diagnosis name p is acute upper respiratory infection, diagnosis name q is hypertension:
MonthRate(p)
=c(0.08,0.09,0.28,0.21,0.04,0.04,0.02,0.01,0.04,0.04,0.07,0.08)
MonthRate(q)
=c(0.09,0.08,0.06,0.07,0.09,0.09,0.08,0.09,0.10,0.08,0.08,0.09)
this calculation yields diff (p, q) ═ std (monthrate (p) -monthrate (p)) > 0.09.
In any of the above embodiments, the global function value is specifically determined based on a reference function value and a medical penalty function value for each two diagnostic names, and at least one of a transfer penalty function value, a similarity penalty function value, and a similarity threshold penalty function value for each synonym of the two diagnostic names.
The synonym transfer penalty function value of any two diagnosis names is determined based on the synonym relation between any two diagnosis names and the rest diagnosis names; the similarity penalty function value of any two diagnosis names is determined based on the synonym relation and the similarity between any two diagnosis names; the similarity threshold penalty function value of any two diagnosis names is determined based on the synonym relation and similarity between any two diagnosis names and a preset similarity threshold.
Specifically, synonyms themselves are transitive, and given that a and B are synonyms of each other and a and C are synonyms of each other, B and C are most likely synonyms of each other. Based on the rule, aiming at any two diagnosis names, if the candidate synonym relationship between the two diagnosis names is 'no', the candidate synonym relationship between the two diagnosis names and the rest diagnosis names is obtained, so that whether the diagnosis names with the candidate synonym relationship of the two diagnosis names being 'yes' exist or not is judged, if yes, the candidate synonym relationship of the two diagnosis names is obviously contrary to the transmissibility of synonyms, and a synonym transmission penalty function value is determined, so that the synonym transmission penalty takes effect to restrict a global function value; if not, the synonym transfer penalty function value is set to zero, and the synonym transfer penalty is not effective.
The similarity and the candidate synonym relationship have a corresponding relationship, the higher the similarity is, the higher the probability that the candidate synonym relationship between the two diagnosis names is 'yes' is, and the lower the similarity is, the higher the probability that the candidate synonym relationship between the two diagnosis names is 'no' is, and based on the rule, aiming at any two diagnosis names, the global function value is constrained through the similarity punishment. Here, when the candidate synonym relationship is "yes", the higher the similarity is, the smaller the similarity penalty function value is, and the lower the similarity is, the larger the similarity penalty function value is; and when the candidate synonym relationship is 'no', the higher the similarity is, the larger the similarity penalty function value is, and the lower the similarity is, the smaller the similarity penalty function value is.
The corresponding candidate synonym relationship may generally be determined by comparing the similarity with a preset similarity threshold. For any two diagnosis names, when the similarity between the two diagnosis names is greater than a preset similarity threshold, the probability that the candidate synonym relation is 'yes' is higher, and the probability that the candidate synonym relation is 'no' is lower; when the similarity is smaller than a preset similarity threshold, the probability that the candidate synonym relation is 'yes' is small, and the probability that the candidate synonym relation is 'no' is large. Based on the rule, if the candidate synonym relationship is 'no' under the condition that the similarity is greater than the preset similarity threshold, or if the candidate synonym relationship is 'yes' under the condition that the similarity is less than the preset similarity threshold, it is obvious that the candidate synonym is contrary to the comparison rule based on the preset similarity threshold, and a similarity threshold penalty function value is correspondingly set, so that the similarity threshold penalty comes into effect to constrain a global function value; if not, setting the similarity threshold penalty function value to zero, and the similarity threshold penalty is not effective.
The method provided by the embodiment of the invention can apply the medical penalty function value to the determination of the global function value, and can also apply at least one of the synonym transfer penalty function value, the similarity penalty function value and the similarity threshold penalty function value to the determination of the global function value, so that while the value of the candidate synonym relationship is restricted through the medical relationship, the value of the candidate synonym relationship is restricted by applying the similarity judgment and the rule of the synonym relationship, and the accuracy of the final synonym relationship determination is further improved.
Based on any of the embodiments, when the candidate synonym relationship between any two diagnostic names is negative and there are a plurality of transfer diagnostic names having the same candidate synonym relationship with the two diagnostic names, the synonym transfer penalty function value of the two diagnostic names is the maximum value of the penalty corresponding to each transfer diagnostic name; the penalty score corresponding to any transfer diagnosis name is determined based on the similarity between the transfer diagnosis name and the two diagnosis names.
Specifically, for any two diagnostic names p and q, either one conveys a diagnostic name riThe ith transitive diagnostic name in the plurality of transitive diagnostic names p and q specifically refers to the diagnostic name with the candidate synonym relationship of the two diagnostic names being "yes", namely, Exist (p, r)i)=1,Exist(q,ri)=1。
If the candidate synonym relationship between p and q exists (p, q) ═ 0, for any delivery diagnosis name riCandidate diagnosis name riThe corresponding penalty score is based on p, q, p, riQ, riWherein p, r are determinediThe similarity between Prob (p, r)i) And q, riThe similarity between Prob (q, r)i) The higher, riThe higher the corresponding penalty score.
Based on any of the above embodiments, for riThe corresponding penalty score is expressed as:
(Prob(p,ri)+Prob(ri,q)-Prob(p,q))*(1-Exist(p,q))
whereinIf the candidate synonym relationship between p and q is "yes", the penalty score is correspondingly 0, and if the candidate synonym relationship between p and q is "no", the penalty score is correspondingly represented as Prob (p, r)i)Prob(riQ) -Prob (p, q). The maximum value is taken for the penalty scores of all the delivered diagnosis names, and the synonym delivery penalty function value S4 can be obtained as follows:
S4=|max((Prob(p,ri)+Prob(ri,q)-Prob(p,q))*(1-Exist(p,q)))|
based on any of the above embodiments, the similarity penalty function value S5 of p and q can be expressed as an absolute difference between the candidate synonym relationship Exist (p, q) of p and q and the similarity Prob (p, q) of p and q, specifically:
S5=|Exist(p,q)-Prob(p,q)|
based on any of the above embodiments, assume that the preset similarity threshold is THGlobalSetting W (p, q) to represent the similarity Prob (p, q) of p and q and a preset similarity threshold value as THGlobalThe size of (a) is specifically:
the similarity threshold penalty function values S6 for p and q thus obtained are:
in the formula (I), the compound is shown in the specification,
for nand notation, when W (p, q) is 1 and Exist (p, q) is 0, or W (p, q) is 0 and Exist (p, q) is 1, i.e., when the similarity is greater than the preset similarity threshold and the candidate synonym relationship is "no", or when the similarity is less than the preset similarity threshold and the candidate synonym relationship is "yes", S6 is 1, the similarity threshold penalty takes effect.
Based on any of the above embodiments, in the method, a global function value may be obtained based on the reference function value S of every two diagnosis names, and the disease timing penalty function value S1, the upper and lower relationship penalty function values S2, the time distribution difference penalty function value S3, the synonym transfer penalty function value S4, the similarity penalty function value S5, and the similarity threshold penalty function value S6, and an objective function for maximizing the global function value may be specifically expressed as the following formula:
maxmize(S-α*S1-β*S2-γ*S3-*S4-*S5-θ*S6)
in the formula, α, β, γ, and θ are preset weights corresponding to S1, S2, S3, S4, S5, and S6, respectively.
Based on any of the above embodiments, fig. 3 is a schematic flowchart of a method for determining a standardized diagnosis name according to an embodiment of the present invention, as shown in fig. 3, step 130 specifically includes:
step 131, determining a plurality of synonym diagnosis name sets based on the final synonym relationship between every two diagnosis names.
Specifically, after the final synonym relationship between every two diagnostic names is obtained, the synonym of each diagnostic name can be determined, so that a plurality of synonym diagnostic name sets are obtained. Here, any synonym diagnosis name set includes a plurality of diagnosis names, and any diagnosis name in the set is a synonym with at least one diagnosis name in the set.
Step 132, if a plurality of bridge diagnosis names exist in any synonym diagnosis name set, taking the bridge diagnosis name with the highest similarity score as the standardized diagnosis name of the synonym diagnosis name set; otherwise, taking the diagnosis name with the highest similarity score in the synonym diagnosis name set as the standardized diagnosis name of the synonym diagnosis name set; wherein the similarity score for any one diagnostic name is determined based on the similarity between that diagnostic name and the diagnostic name for which each final synonym relationship is yes.
Specifically, in any synonym diagnosis name set, if deletion of one of the diagnosis names causes the existence of a diagnosis name in the set which is not a synonym with each of the rest diagnosis names, the deleted diagnosis name is used as a bridge diagnosis name. The bridge diagnosis names play a role in communicating all diagnosis names in the set in the synonym diagnosis name set.
In general, there are two cases in the synonym diagnosis name set, one is that there are a plurality of bridge diagnosis names in the synonym diagnosis name set, and the other is that there are no bridge diagnosis names in the synonym diagnosis name set.
Fig. 4 is a schematic structural diagram of a synonym diagnostic name set having a bridge diagnostic name according to an embodiment of the present invention, fig. 5 is a schematic structural diagram of a synonym diagnostic name set having no bridge diagnostic name according to an embodiment of the present invention, in fig. 4 and 5, each node corresponds to one diagnostic name in the synonym diagnostic name set, and a connection line between two nodes indicates that a final synonym relationship between two diagnostic names is "yes", that is, two nodes are synonyms, and a value on the connection line between the nodes is a similarity between two diagnostic names. In fig. 4, A, B are all bridge diagnosis names, and in fig. 5, no bridge diagnosis name exists.
When the bridge diagnosis names exist, calculating the similarity score of each bridge diagnosis name, and taking the bridge diagnosis name with the highest similarity score as a standardized diagnosis name; when there is no bridge diagnosis name, a similarity score is calculated for each diagnosis name, and the diagnosis name having the highest similarity score is set as the standardized diagnosis name. The similarity score is determined based on the similarity between the diagnosis name and each diagnosis name having the final synonym relationship of "yes", that is, based on the value on the connecting line of the nodes corresponding to the diagnosis name.
Further, the similarity score may be a combination of the similarity between the diagnosis name and each diagnosis name having the final synonym relationship "yes", for example, in fig. 4, the similarity score of the node a is 7.3, and the similarity score of the node B is 6.67, so the diagnosis name corresponding to the node a is taken as the standardized diagnosis name. In fig. 5, the node with the highest similarity score is a, and the diagnosis name corresponding to the node a is defined as the standardized diagnosis name.
Based on any of the above embodiments, the similarity between any two diagnosis names is determined based on the diagnosis attributes corresponding to the two diagnosis names, and the diagnosis attribute corresponding to any diagnosis name is extracted from the medical record data corresponding to the diagnosis name.
Specifically, a large amount of medical record data corresponding to any diagnosis name may be collected in advance, and a plurality of diagnosis attributes corresponding to the diagnosis name may be extracted from the medical record data. Here, the diagnostic attribute may include at least one of a chief complaint, a current medical history, an allergy history, a past medical history, an auxiliary examination, and a visiting department.
The similarity Prob (p, q) between the diagnosis names p and q can be obtained by weighting the similarity of each diagnosis attribute corresponding to the diagnosis names p and q, for example, the relevant information corresponding to the diagnosis names p and q includes Main complaint Main, Current medical history, Allergy history, past medical history, Auxiliary examination Auxiliary, and visiting department Dep, and the similarity Prob _ Main (p, q) of the Main complaint Main is calculated as follows by taking the Main complaint Main as an example:
in the formula, p _ Main and q _ Main represent the Main complaints of p and q, respectively, p-Main ∩ q _ Main and p _ Main ∪ q _ Main are the intersection and union of the two, respectively, and cart represents the number of elements in the set.
Based on similar formulas, the similarity of the Current medical history Prob _ Current (p, q), the similarity of the Allergy history Prob _ Allergy (p, q), the similarity of the past medical history Prob _ Previous (p, q), the similarity of the Auxiliary examination Prob _ Auxiliary (p, q), and the similarity of the visiting department Prob _ Dep (p, q) can be obtained respectively.
Then, the similarity Prob (p, q) between p and q can be obtained by weighting based on the similarity of each piece of relevant information.
Based on any one of the embodiments, the method for standardizing the diagnosis name comprises the following steps:
first, a large amount of medical record data is acquired, and a plurality of diagnosis names to be standardized and diagnosis attributes corresponding to each diagnosis name are extracted from the large amount of medical record data. Based on the diagnostic attributes of every two diagnostic names, the similarity between every two diagnostic names is calculated. In addition, the time sequence relation, the upper and lower position relation and the difference of the distribution of the disease onset time between every two diagnosis names are determined by combining medical knowledge.
Secondly, calculating a global function value based on the similarity, the time sequence relation, the upper and lower relation and the distribution difference of the attack time between every two diagnosis names, and adjusting the candidate synonym relation between every two diagnosis names by taking the maximized global function value as a target until the maximum value of the global function value is obtained.
And taking the candidate synonym relation between every two diagnosis names corresponding to the maximum value as the final synonym relation between every two diagnosis names.
After the final synonym relationship between every two diagnostic names is determined, a plurality of sets of synonym diagnostic names can be determined based on the final synonym relationship between every two diagnostic names. Aiming at any synonym diagnosis name set, if the bridge diagnosis name exists, calculating the similarity score of each bridge diagnosis name, and taking the bridge diagnosis name with the highest similarity score as a standardized diagnosis name; otherwise, a similarity score is calculated for each diagnosis name, and the diagnosis name with the highest similarity score is taken as the standardized diagnosis name.
Based on any of the above embodiments, fig. 6 is a schematic structural diagram of a diagnosis name normalizing device provided by an embodiment of the present invention, as shown in fig. 6, the diagnosis name normalizing device includes a diagnosis name determining unit 610, a synonym relation determining unit 620, and a normalizing unit 630;
wherein, the diagnosis name determining unit 610 is used for determining a plurality of diagnosis names;
the synonym relationship determining unit 620 is configured to adjust the candidate synonym relationship between every two diagnostic names based on the similarity and the medical relationship between every two diagnostic names in the plurality of diagnostic names, so as to obtain a final synonym relationship between every two diagnostic names;
the normalization unit 630 is used for determining a normalized diagnosis name corresponding to each diagnosis name based on the final synonym relationship between every two diagnosis names.
The device provided by the embodiment of the invention adjusts the synonym relation between the diagnosis names based on the similarity and the medical relation between every two diagnosis names, so that the standardization of the diagnosis names is realized, the medical knowledge is integrated into the determination process of the synonym relation, and the synonym relation and the similarity are mutually restricted, thereby improving the accuracy and the reliability of the standardization of the diagnosis names.
Based on any of the above embodiments, the medical relationship between any two diagnosis names includes at least one of a time-series relationship, an upper-lower relationship, and a difference in distribution of onset time between the any two diagnosis names.
Based on any of the embodiments above, the synonym relationship determining unit 620 is specifically configured to:
adjusting the candidate synonym relation between every two diagnostic names in the plurality of diagnostic names until the corresponding obtained global function value obtains the maximum value;
taking the candidate synonym relation between every two diagnosis names corresponding to the maximum value as the final synonym relation between every two diagnosis names;
wherein the global function value is determined based on a reference function value and a medical penalty function value for each two diagnostic names; the reference function value of any two diagnosis names is determined based on the candidate synonym relationship and the similarity between the any two diagnosis names, and the medical penalty function value of any two diagnosis names is determined based on the candidate synonym relationship and the medical relationship between the any two diagnosis names.
According to any of the above embodiments, the medical penalty function value includes at least one of a disease timing penalty function value, a superior-inferior relationship penalty function value, and a time distribution difference penalty function value;
the disease time sequence penalty function value of any two diagnosis names is determined based on the time sequence relation and the candidate synonym relation between the any two diagnosis names;
the penalty function value of the upper and lower relations of any two diagnosis names is determined based on the upper and lower relations between any two diagnosis names and the candidate synonym relation;
the time distribution difference penalty function value of any two diagnosis names is determined based on the time distribution difference of onset between the any two diagnosis names and the candidate synonym relationship.
According to any of the above embodiments, the global function value is specifically determined based on a reference function value and a medical penalty function value of every two diagnosis names, and at least one of a transfer penalty function value, a similarity penalty function value and a similarity threshold penalty function value of a synonym of every two diagnosis names;
the synonym transfer penalty function value of any two diagnosis names is determined based on the candidate synonym relation between any two diagnosis names and the rest diagnosis names;
the similarity penalty function value of any two diagnosis names is determined based on the candidate synonym relation and the similarity between any two diagnosis names;
the similarity threshold penalty function value of any two diagnosis names is determined based on the candidate synonym relationship and the similarity between any two diagnosis names and a preset similarity threshold.
Based on any of the above embodiments, when the candidate synonym relationship between any two diagnostic names is negative and there are a plurality of transfer diagnostic names whose candidate synonym relationship with the two diagnostic names is positive, the synonym transfer penalty function value of any two diagnostic names is the maximum value of the penalty corresponding to each transfer diagnostic name;
the penalty score corresponding to any transfer diagnosis name is determined based on the similarity between any transfer diagnosis name and any two diagnosis names.
Based on any of the above embodiments, the normalization unit 630 is specifically configured to:
determining a plurality of synonym diagnostic name sets based on the final synonym relationship between every two diagnostic names;
if a plurality of bridge diagnosis names exist in any synonym diagnosis name set, taking the bridge diagnosis name with the highest similarity score as the standardized diagnosis name of any synonym diagnosis name set;
otherwise, taking the diagnosis name with the highest similarity score in any synonym diagnosis name set as the standardized diagnosis name of any synonym diagnosis name set;
wherein the similarity score for any diagnostic name is determined based on the similarity between said any diagnostic name and the diagnostic name for which each final synonym relationship is yes.
Based on any of the above embodiments, the similarity between any two diagnosis names is determined based on the diagnosis attributes corresponding to the any two diagnosis names, and the diagnosis attribute corresponding to any diagnosis name is extracted from the medical record data corresponding to the any diagnosis name.
Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 7, the electronic device may include: a processor (processor)710, a communication Interface (Communications Interface)720, a memory (memory)730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. Processor 710 may call logical commands in memory 730 to perform the following method: determining a plurality of diagnostic names; based on the similarity and medical relationship between every two diagnostic names in the plurality of diagnostic names, adjusting the candidate synonym relationship between every two diagnostic names to obtain the final synonym relationship between every two diagnostic names; and determining the standardized diagnosis name corresponding to each diagnosis name based on the final synonym relation between every two diagnosis names.
In addition, the logic commands in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the logic commands are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes a plurality of commands for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Embodiments of the present invention further provide a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the method provided in the foregoing embodiments when executed by a processor, and the method includes: determining a plurality of diagnostic names; based on the similarity and medical relationship between every two diagnostic names in the plurality of diagnostic names, adjusting the candidate synonym relationship between every two diagnostic names to obtain the final synonym relationship between every two diagnostic names; and determining the standardized diagnosis name corresponding to each diagnosis name based on the final synonym relation between every two diagnosis names.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes commands for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.