CN109300550B

CN109300550B - Medical data relation mining method and device

Info

Publication number: CN109300550B
Application number: CN201811330207.XA
Authority: CN
Inventors: 焦增涛
Original assignee: Tianjin Happy Life Technology Co ltd; Tianjin Xinkaixin Life Technology Co ltd
Current assignee: Tianjin Happy Life Technology Co ltd; Tianjin Xinkaixin Life Technology Co ltd
Priority date: 2018-11-09
Filing date: 2018-11-09
Publication date: 2021-11-26
Anticipated expiration: 2038-11-09
Also published as: CN113963804A; CN109300550A

Abstract

The invention relates to a medical data relation mining method and device, electronic equipment and a computer readable medium. The method comprises the following steps: acquiring first medical data and second medical data in a target text; performing feature extraction on the first medical data and the second medical data to obtain feature vectors of the first medical data and the second medical data; inputting the feature vectors into a trained classification model, and determining a target relationship between the first medical data and the second medical data. The method and the system can efficiently identify the relationship between the medical data in the clinical case text, improve the efficiency of relation mining of the medical data, and are favorable for further data statistical analysis.

Description

Medical data relation mining method and device

Technical Field

The invention relates to the field of medical information extraction, in particular to a medical data relation mining method, a medical processing device, electronic equipment and a computer readable medium.

Background

In the clinical case text, much information is recorded in a long text form, so that the subsequent statistical analysis task is inconvenient. The structuring of clinical cases can solve such technical problems. Among them, the relationship mining of medical terms in long texts is a very important step for the clinical data structuring.

In the prior art, a method of artificial abstract rule and a method of text syntax dependency analysis based on natural language processing are used for medical data relation mining.

However, in the above method of artificial abstract rule, the artificial rule is a cutting method, and the effect depends on the fineness of the rule. The above method based on text syntactic dependency analysis in natural language processing has a very high labeling cost for specific domain training, so it has few direct applications in clinical cases.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present invention and therefore may include information that does not constitute prior art known to a person of ordinary skill in the art.

Disclosure of Invention

The invention aims to provide a medical data relationship mining method and a medical data relationship mining device, which can efficiently identify the relationship between medical data in clinical case texts and improve the efficiency of medical data relationship mining.

According to an aspect of the present invention, there is provided a medical data relationship mining method, including: acquiring first medical data and second medical data in a target text; performing feature extraction on the first medical data and the second medical data to obtain feature vectors of the first medical data and the second medical data; inputting the feature vectors into a trained classification model, and determining a target relationship between the first medical data and the second medical data.

In an exemplary embodiment of the invention, the target relation comprises any one of a negative word to medical data relation, a time to medical data relation, a numerical value to medical data relation, an anatomical region to medical data relation, an action to medical data relation, and a relative to medical data relation.

In an exemplary embodiment of the invention, the feature extracting the first medical data and the second medical data includes: acquiring at least one of a native feature of the first medical data, a native feature of the second medical data, a peripheral text feature, a syntactic dependency analysis feature, and a sentence morphology feature of the first medical data and the second medical data.

In an exemplary embodiment of the invention, the intrinsic characteristic of the first medical data comprises at least one of the following characteristics: whether the first medical data is a diagnosis; whether the first medical data is an anatomical site; whether the first medical data is a symptom; whether the first medical data is a pathological word; whether the first medical data is a negative word; whether the first medical data comprises a verb; whether the first medical data contains a number; whether the length of the first medical data is larger than a preset byte or not; whether the first medical data contains a time word.

In an exemplary embodiment of the present invention, the peripheral text feature includes at least one of a preceding information text feature of the first medical data, a following information text feature of the second medical data, and a text feature between the first medical data and the second medical data.

In an exemplary embodiment of the invention, the prior information text feature of the first medical data comprises at least one of the following features: whether a period number exists in a preset word in front of the first medical data or not is judged; whether a comma exists in the preset words in front of the first medical data or not; whether a space or a pause number exists in the preset word in front of the first medical data or not; whether negative words exist in the preset words in front of the first medical data or not is judged; whether a backward-acting negative word exists in the preset words in front of the first medical data or not; whether the first medical data is preceded by a 'companion' in the preset word; whether the first medical data is preceded by an even in the preset word; whether an omitted word exists in the preset words in front of the first medical data or not; whether a verb representing a behavior exists in the preset word in front of the first medical data or not; whether a diagnosis exists in the preset words in front of the first medical data or not; whether an anatomical part exists in the preset words in front of the first medical data or not; whether the first medical data is the symptom or not in the preset word is judged; whether a pathological word exists in the preset word in front of the first medical data or not; whether a continuous concept punctuation segmentation mode exists in the preset words in front of the first medical data or not; whether the time is in the preset word in front of the first medical data or not; whether the preset words in the front of the first medical data have numbers or not; whether the first medical data is preceded by a verb in the preset word or not.

In an exemplary embodiment of the invention, the textual features between the first medical data and the second medical data comprise at least one of the following features: a distance between the first medical data and the second medical data; an order between the first medical data and the second medical data; a number of periods between the first medical data and the second medical data; a number of commas between the first medical data and the second medical data; the number of spaces or pause numbers between the first medical data and the second medical data; whether there is a "tie" between the first medical data and the second medical data; whether there is an "even" between the first medical data and the second medical data; whether there is a verb between the first medical data and the second medical data that represents a behavior; whether there is a backward-only negative word between the first medical data and the second medical data; whether there is an escape word between the first medical data and the second medical data; whether there is a negative word between the first medical data and the second medical data; whether there is a diagnosis between the first medical data and the second medical data; whether there is an anatomical region between the first medical data and the second medical data; whether there is a symptom between the first medical data and the second medical data; whether there is a lesion word between the first medical data and the second medical data; whether there is a pattern of continuous conceptual punctuation segmentation between the first medical data and the second medical data; whether there is a number between the first medical data and the second medical data; whether there is time between the first medical data and the second medical data; whether there is a verb between the first medical data and the second medical data.

In an exemplary embodiment of the invention, the syntactic dependency analysis characteristics include at least one of the following characteristics: whether there is a parent-child relationship between the first medical data and the second medical data; a dependency on-tree path length between the first medical data and the second medical data; whether a path between the first medical data and the second medical data has a major-minor relationship edge; whether a moving object relation edge exists on a path between the first medical data and the second medical data or not; whether a path between the first medical data and the second medical data has a centering relation or a structural edge in a shape; whether a first edge on a path between the first medical data and the second medical data moves a guest relationship or a major-minor relationship; whether a first edge on a path between the first medical data and the second medical data is in a centered relationship or a structure in a shape; whether the last edge on the path between the first medical data and the second medical data moves a guest relationship or a major-minor relationship; and whether the last edge on the path between the first medical data and the second medical data moves a guest relationship or a major-minor relationship.

In an exemplary embodiment of the present invention, the sentence pattern feature includes at least one of the following features: whether the first medical data and the second medical data are in a paragraph; whether the first medical data and the second medical data are in one sentence; whether the first medical data and the second medical data are in a clause; whether the first medical data and the second medical data are in a paragraph and other medical data that is in-class with the first medical data or in-class with the second medical data does not exist in between; whether the first medical data and the second medical data are in one sentence, and other medical data which are the same as the first medical data or the second medical data do not exist in the middle; whether the first medical data and the second medical data are in one clause and other medical data that are in the same class as the first medical data or in the same class as the second medical data are not present in between.

According to an aspect of the present invention, there is provided a medical data relationship mining apparatus, including: the medical data acquisition module is configured to acquire first medical data and second medical data in the target text; a feature extraction module configured to perform feature extraction on the first medical data and the second medical data to obtain feature vectors of the first medical data and the second medical data; and the target relation judging module is configured to input the feature vectors into a trained classification model and judge the target relation between the first medical data and the second medical data.

According to an aspect of the present invention, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the medical data relationship mining method according to any of the embodiments described above.

According to an aspect of the present invention, there is provided an electronic apparatus including: one or more processors; a storage device configured to store one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the medical data relationship mining method of any of the embodiments described above.

In the medical data relationship mining method and the medical data relationship mining device in an exemplary embodiment of the invention, first medical data and second medical data in a target text are obtained; performing feature extraction on the first medical data and the second medical data to obtain feature vectors of the first medical data and the second medical data; and inputting the feature vector into a trained classification model, judging the target relationship between the first medical data and the second medical data, efficiently identifying the relationship between the medical data in the clinical case text, and improving the efficiency of medical data relationship mining so as to be beneficial to further data statistical analysis.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The above and other features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.

FIG. 1 shows a flow diagram of a medical data relationship mining method according to an exemplary embodiment of the invention;

FIG. 2 shows a schematic diagram of a classification model feature set according to an exemplary embodiment of the invention;

FIG. 3 shows a flow diagram of a medical data relationship mining method according to another exemplary embodiment of the invention;

FIG. 4 shows a flow diagram of a medical data relationship mining method according to a further exemplary embodiment of the present invention;

FIG. 5 shows a flow diagram of a medical data relationship mining method according to a further exemplary embodiment of the present invention;

FIG. 6 shows a block diagram of a medical data relationship mining apparatus according to an exemplary embodiment of the present invention;

FIG. 7 is a diagram illustrating an exemplary system architecture to which a medical data relationship mining method or medical data relationship mining apparatus of an embodiment of the present invention may be applied;

FIG. 8 shows a schematic diagram of an electronic device suitable for use in implementing embodiments of the present invention.

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, materials, devices, steps, and so forth. In other instances, well-known structures, methods, devices, implementations, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in the form of software, or in one or more software-hardened modules, or in different networks and/or processor devices and/or microcontroller devices.

In the prior art, the following three methods are adopted to perform medical data relation mining:

the first method comprises the following steps: a method for manually abstracting a rule. Judging whether the medical data satisfy a certain relation from the text morphology, and further judging whether the relation between the medical data is established. For example, it is determined whether two medical data are within one comma separated sentence, etc.

The first type of process described above has at least the following disadvantages: the manual rule is a cutting method, and the effect depends on the fineness of the rule; the labor cost is high; risk of rule failing to cover for new data; conflicting mutexes may exist between the rules.

The second method comprises the following steps: a method for text syntax dependency analysis in natural language processing. The syntactic dependency analysis is a classic task of natural language processing, and can judge whether each word in a sentence meets grammatical relations such as a dominance-predicate relation, a moving-object relation, a modification relation and the like. And judging whether the medical data meets the target relation or not based on the structure of the dependency analysis.

The second type of method described above has at least the following disadvantages: the method is an ideal method, but the current Chinese syntax dependence analysis model in the industry has general effect, and the labeling cost is very high when the method is trained aiming at a specific field, so that the method has few direct applications in clinical cases.

The third method comprises the following steps: diagnosing specific medical data relationships and training classification models. According to the task target, labeling samples of the medical data relationship in clinical cases, classifying by using some universal machine learning classification models, and judging whether the target relationship is established.

The third method is a relatively feasible method, and labels the training corpus to classify according to the specific relationship type and the specific application field, and determines whether the corresponding relationship is true. However, such methods require targeted labeling and training for each medical term relationship and each application scenario, and the results are not expandable.

In the embodiment of the present invention, the medical data may also be referred to as clinical data, which may be medical terms, wherein the medical terms refer to words that can characterize a definite medical concept in a medical research or medical event, and the definition of the clinical data needs to be defined in combination with a specific clinical task, such as "mother" in a specific task, and some words like "father", "mother" and the like may also be the objects of interest of the specific medical task, i.e. may also be medical terms.

In the embodiment of the present invention, the mining of the medical data relationship type refers to medical information displayed in a long text in a clinical case, and usually a plurality of medical terms or a combination and collocation of the medical terms and other words are expressed.

For example, family history: "the father is healthy and the mother is deceased and died from lung cancer", wherein the key medical information is:

relatives: maternal family diseases: lung cancer

The medical data relationship refers to the discovery from the text that lung cancer is a disease of the mother, not of the father.

The invention provides a medical data relation mining method which can be used for abstracting relation types among medical data and solving recognition by machine learning.

In the exemplary embodiment, a medical data relationship mining method is first provided. Referring to fig. 1, the medical data relationship mining method includes the steps of:

in step S110, first medical data a and second medical data B in the target text are acquired.

In the embodiment of the invention, the target text can be a clinical case to be mined, A and B can be extracted from a long text of the clinical case through a set of entity identification algorithm, and the specific entity identification algorithm can refer to the prior art and is not detailed herein.

In step S120, feature extraction is performed on the first medical data and the second medical data, and feature vectors of the first medical data and the second medical data are obtained.

In an exemplary embodiment, the feature extracting the first medical data and the second medical data may include: acquiring at least one of the self-feature of the first medical data, the self-feature of the second medical data, the peripheral text feature, the dependency analysis feature, the sentence morphology feature and the like of the first medical data and the second medical data.

In an exemplary embodiment, the intrinsic characteristic of the first medical data may include at least one of the following characteristics: whether the first medical data is a diagnosis; whether the first medical data is an anatomical site; whether the first medical data is a symptom; whether the first medical data is a pathological word; whether the first medical data is a negative word; whether the first medical data comprises a verb; whether the first medical data contains a number; whether the length of the first medical data is larger than a preset byte or not; whether the first medical data contains a time word.

In an exemplary embodiment, the peripheral text feature may include at least one of a preceding information text feature of the first medical data, a following information text feature of the second medical data, a text feature between the first medical data and the second medical data, and the like.

In an exemplary embodiment, the prior information text feature of the first medical data may include at least one of the following features: whether a period number exists in a preset word in front of the first medical data or not is judged; whether a comma exists in the preset words in front of the first medical data or not; whether a space or a pause number exists in the preset word in front of the first medical data or not; whether negative words exist in the preset words in front of the first medical data or not is judged; whether a backward-acting negative word exists in the preset words in front of the first medical data or not; whether the first medical data is preceded by a 'companion' in the preset word; whether the first medical data is preceded by an even in the preset word; whether an omitted word exists in the preset words in front of the first medical data or not; whether a verb representing a behavior exists in the preset word in front of the first medical data or not; whether a diagnosis exists in the preset words in front of the first medical data or not; whether an anatomical part exists in the preset words in front of the first medical data or not; whether the first medical data is the symptom or not in the preset word is judged; whether a pathological word exists in the preset word in front of the first medical data or not; whether a continuous concept punctuation segmentation mode exists in the preset words in front of the first medical data or not; whether the time is in the preset word in front of the first medical data or not; whether the preset words in the front of the first medical data have numbers or not; whether the first medical data is preceded by a verb in the preset word or not.

In an exemplary embodiment, the textual features between the first medical data and the second medical data may include at least one of the following features: a distance between the first medical data and the second medical data; an order between the first medical data and the second medical data; a number of periods between the first medical data and the second medical data; a number of commas between the first medical data and the second medical data; the number of spaces or pause numbers between the first medical data and the second medical data; whether there is a "tie" between the first medical data and the second medical data; whether there is an "even" between the first medical data and the second medical data; whether there is a verb between the first medical data and the second medical data that represents a behavior; whether there is a backward-only negative word between the first medical data and the second medical data; whether there is an escape word between the first medical data and the second medical data; whether there is a negative word between the first medical data and the second medical data; whether there is a diagnosis between the first medical data and the second medical data; whether there is an anatomical region between the first medical data and the second medical data; whether there is a symptom between the first medical data and the second medical data; whether there is a lesion word between the first medical data and the second medical data; whether there is a pattern of continuous conceptual punctuation segmentation between the first medical data and the second medical data; whether there is a number between the first medical data and the second medical data; whether there is time between the first medical data and the second medical data; whether there is a verb between the first medical data and the second medical data.

In an exemplary embodiment, the dependency analysis feature may include at least one of the following features: whether there is a parent-child relationship between the first medical data and the second medical data; a dependency on-tree path length between the first medical data and the second medical data; whether a path between the first medical data and the second medical data has a major-minor relationship edge; whether a moving object relation edge exists on a path between the first medical data and the second medical data or not; whether a path between the first medical data and the second medical data has a centering relation or a structural edge in a shape; whether a first edge on a path between the first medical data and the second medical data moves a guest relationship or a major-minor relationship; whether a first edge on a path between the first medical data and the second medical data is in a centered relationship or a structure in a shape; whether the last edge on the path between the first medical data and the second medical data moves a guest relationship or a major-minor relationship; and whether the last edge on the path between the first medical data and the second medical data moves a guest relationship or a major-minor relationship.

In an exemplary embodiment, the sentence morphology features may include at least one of the following features: whether the first medical data and the second medical data are in a paragraph; whether the first medical data and the second medical data are in one sentence; whether the first medical data and the second medical data are in a clause; whether the first medical data and the second medical data are in a paragraph and other medical data that is in-class with the first medical data or in-class with the second medical data does not exist in between; whether the first medical data and the second medical data are in one sentence, and other medical data which are the same as the first medical data or the second medical data do not exist in the middle; whether the first medical data and the second medical data are in one clause and other medical data that are in the same class as the first medical data or in the same class as the second medical data are not present in between.

In step S130, the feature vectors are input to the trained classification model, and a target relationship between the first medical data and the second medical data is determined.

In an exemplary embodiment, the target relationship may include any one of a negative word to medical data relationship, a time to medical data relationship, a numerical value to medical data relationship, an anatomical region to medical data relationship, an action to medical data relationship, a relative to medical data relationship, and the like.

In an exemplary embodiment, the medical data relational class hierarchy is abstracted in advance. Starting from clinical data and medical needs, the relationship between two medical data can be abstracted into the following categories, as shown in table 1:

TABLE 1 medical data Category System

It should be noted that the types of relationships between medical data are not limited to the ones listed in table 1, and the category system may be divided from other points, and the basic requirement is to have a definite semantic type and to cover most medical data relationships, for example, one medical data is fixed and the second medical data is further grouped. As a specific example, the negative relationship: the negative word is fixed as a and the type of B is arbitrary.

Where explicit semantic types refer to relational type abstractions, such as: negative relation, time relation, numerical relation, action relation and the like, and the semantic type is clear.

According to the medical data relationship mining method in the present exemplary embodiment, by acquiring first medical data and second medical data in a target text; performing feature extraction on the first medical data and the second medical data to obtain feature vectors of the first medical data and the second medical data; and inputting the feature vector into a trained classification model, judging the target relationship between the first medical data and the second medical data, efficiently identifying the relationship between the medical data in the clinical case text, and improving the efficiency of medical data relationship mining so as to be beneficial to further data statistical analysis.

As shown in FIG. 2, in the embodiment of the present invention, the designed classification model feature set may include AB self features, peripheral text features, syntactic dependency analysis features, and sentence morphology features.

Wherein the AB intrinsic characteristics may in turn comprise a intrinsic characteristic and B intrinsic characteristic.

The peripheral text features may further include a left text feature (also referred to as an a preceding information text feature), a B right text feature (also referred to as a B following information text feature), and an AB between text features.

For example, the feature set may contain the following information (here, taking two medical data as an example, the first medical data is denoted by a, and the second medical data is denoted by B):

TABLE 2 feature set

It should be noted that the specific feature values in table 2 may be calculated in various manners, such as searching the second medical data B within other distances from the first medical data a, i.e. not limited to 10 words in the table. In a specific medical task, the data form in the table may be adjusted as needed to optimize specific numbers, which is not limited by the present invention.

The dependency grammar reveals the syntactic structure by analyzing the dependency relationship among the components in the language unit, and the core verb in the sentence is claimed to be the central component which governs other components, but is not governed by any other components, and all the governed components depend on a governing person with a certain dependency relationship. The phenomena of mutual support and domination among sentence components and dependency and dependence generally exist in language units which can be independently used at all levels of Chinese words (synthetic languages), phrases, simple sentences and compound sentence groups, the characteristic is the universality of dependency relationship, dependency syntactic analysis can reflect semantic modification relationship among the sentence components, long-distance collocation information can be obtained, and the phenomenon is independent of physical positions of the sentence components.

The dependency parsing labels and meanings referred to in table 2 above are as follows in table 3:

type of relationship	Identification (Tag)	Description (description)
			Relationship between major and minor	SBV	subject-verb
Moving guest relationship	VOB	Direct object, verb-object
			Centering relationships	ATT	attribute
Middle structure	ADV	adverbial

TABLE 3 syntactic dependencies

It should be noted that, in the prior art, the target relationship is extracted according to the grammar structure template by directly using the grammar structure obtained by the dependency analysis, but in the embodiment of the present invention, the key grammar structure is used as the feature of the classification model, and the automatic learning is performed through data driving.

In the embodiment of the invention, in the clinical case structuring task, the medical data relation is mined from a long text, and the method with both effect and universality is provided. If the classification model adopts a two-classification model, the basic idea is to abstract the medical data relationship into a two-classification problem.

The following description will take the classification model as a binary classification model as an example. As shown in fig. 3, the medical data relationship mining method provided by the embodiment of the invention may include the following steps.

In step S310, a target relationship is determined based on the target medical task.

In the embodiment of the invention, the target relationship is known, and for a specific medical task, task disassembly is carried out according to the task itself to obtain the target relationship.

In step S320, first training medical data and second training medical data having the target relationship in the training corpus are obtained.

In step S330, labeling the first training medical data and the second training medical data in the training corpus.

For example, "father is diabetic, mother is healthy," can be labeled as:

father and diabetes 1

"mother" and "diabetes" 0

In step S340, the features of the first training medical data and the second training medical data are extracted to obtain feature vectors of the first training medical data and the second training medical data.

In the embodiment of the present invention, feature extraction may be performed according to the feature set listed in table 2, for example, if a condition is satisfied, the value of the corresponding bit is 1, and if the condition is not satisfied, the corresponding bit is set to 0, and if a is a diagnosis in the AB feature itself, the first bit of the feature vector is 1, and if a is not a diagnosis, the first bit of the feature vector is 0; if A is an anatomical part, the second bit of the feature vector is 1, and if A is not an anatomical part, the second bit of the feature vector is 0; and so on. And (4) tiling the feature values of all dimensions, and placing the specific feature values in fixed positions in the vector to form a feature vector.

It should be noted that the values of each bit of the feature vector may be set as required, and are not limited to "1" and "0" described above.

In step S350, a binary classification model is trained using the feature vectors of the first and second training medical data.

In step S360, first medical data and second medical data in the target text are acquired.

In step S370, feature extraction is performed on the first medical data and the second medical data, and feature vectors of the first medical data and the second medical data are obtained.

In step S380, feature vectors of the first medical data and the second medical data are input to a trained two-class classification model, and whether the target relationship between the first medical data and the second medical data is established is determined.

The following description will be given by taking the classification model as a binary classification model. As shown in fig. 4, the medical data relationship mining method provided by the embodiment of the invention may include the following steps.

In step S410, the medical data relational category system is abstracted.

In the embodiment of the invention, a category system is predefined, and target classification is determined according to a specific medical task. Can be used to train corpora according to the target label.

In step S420, a classification model feature set is designed.

In an embodiment of the present invention, the feature set may include at least one of a feature of the medical data itself, a feature of the peripheral text, a syntactic dependency feature, a feature of the sentence morphology, and the like.

In step S430, a two-class classification model is trained.

In the embodiment of the present invention, based on the medical data relationship type defined in the step S410, a target relationship training corpus is labeled; then, according to the characteristics defined in the step S420, extracting the characteristics of the training corpus, and vectorizing the long text; and then, training the vectorized training corpus by using the classification model.

In the embodiment of the invention, any one of a decision tree model, a naive Bayes model, a support vector machine, a deep learning and the like can be used.

In step S440, the medical data relationships are classified.

In the embodiment of the present invention, feature extraction may be performed on the target text, which is new clinical data, according to the features defined in step S420 to form a vectorized representation, and the vectorized representation is input to the classification model trained in step S430, so as to perform classification to determine whether the target relationship is established.

In other embodiments, the problem itself may also be abstracted into multiple classes, with the classification model directly outputting the concrete relationship given the two medical data.

The following description will take the classification model as a multi-classification model as an example. As shown in fig. 5, the medical data relationship mining method provided by the embodiment of the invention may include the following steps.

In step S510, first training medical data and second training medical data in a training corpus are acquired.

In step S520, labeling the first training medical data and the second training medical data in the training corpus, where the labeled content includes a target relationship between the first training medical data and the second training medical data.

In the embodiment of the invention, because a multi-classification mode is adopted, the characteristic vector of the case long text is directly input, and the multi-classification model can directly output the target relation between A and B. Therefore, the labeling mode of the corpus needs to be changed, that is, the multi-classification labeling needs to label a specific relationship type, rather than whether one of specific relationship types is labeled or not.

In step S530, the features of the first training medical data and the second training medical data are extracted, and feature vectors of the first training medical data and the second training medical data are obtained.

In step S540, a multi-classification model is trained using the feature vectors of the first and second training medical data.

In step S550, first medical data and second medical data in the target text are acquired.

In step S560, feature extraction is performed on the first medical data and the second medical data, and feature vectors of the first medical data and the second medical data are obtained.

In step S570, the feature vectors of the first medical data and the second medical data are input to a trained two-class classification model, and a target relationship between the first medical data and the second medical data is output.

According to the medical data relation mining method provided by the embodiment of the invention, on one hand, by designing a general medical data relation system, a trained model can be reused, and the efficiency of structuring new clinical data is improved, so that the effect of relation mining of medical data can be improved, the efficiency of relation mining of medical data is improved, the data value can be accumulated, and along with the increase of labeled data, the relation recognition effect is better and better, and historical data can be accumulated; on the other hand, the type abstraction in the embodiment of the invention has universality, the model effect has expansibility, a model is not trained by a relationship, and the labeling work is greatly reduced; therefore, the problems of coverage and rule conflict of the traditional rule method on clinical case data can be solved; the problem of low structuring accuracy based on the syntactic dependency analysis technology can be solved.

It is noted that although the steps of the methods of the present invention are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

Fig. 6 shows a block diagram of a medical data relationship mining apparatus 600 according to another exemplary embodiment of the present invention.

As shown in fig. 6, the medical data relationship mining apparatus 600 includes: a medical data acquisition module 610, a feature extraction module 620, and an object relationship determination module 630. Wherein:

the medical data acquisition module 610 may be configured to acquire the first medical data and the second medical data in the target text.

The feature extraction module 620 may be configured to perform feature extraction on the first medical data and the second medical data to obtain feature vectors of the first medical data and the second medical data.

The target relationship determination module 630 may be configured to input the feature vectors to a trained classification model, and determine a target relationship between the first medical data and the second medical data.

In an exemplary embodiment, the feature extraction module 620 may further include a feature extraction unit, and the feature extraction unit may be configured to acquire at least one of a native feature of the first medical data, a native feature of the second medical data, a peripheral text feature, a syntactic dependency analysis feature, a sentence morphology feature, and the like of the first medical data and the second medical data.

In an exemplary embodiment, the intrinsic characteristic of the first medical data comprises at least one of the following characteristics: whether the first medical data is a diagnosis; whether the first medical data is an anatomical site; whether the first medical data is a symptom; whether the first medical data is a pathological word; whether the first medical data is a negative word; whether the first medical data comprises a verb; whether the first medical data contains a number; whether the length of the first medical data is larger than a preset byte or not; whether the first medical data contains a time word.

In an exemplary embodiment, the peripheral text feature includes at least one of a preceding information text feature of the first medical data, a following information text feature of the second medical data, and a text feature between the first medical data and the second medical data.

In an exemplary embodiment, the prior information text feature of the first medical data comprises at least one of the following features: whether a period number exists in a preset word in front of the first medical data or not is judged; whether a comma exists in the preset words in front of the first medical data or not; whether a space or a pause number exists in the preset word in front of the first medical data or not; whether negative words exist in the preset words in front of the first medical data or not is judged; whether a backward-acting negative word exists in the preset words in front of the first medical data or not; whether the first medical data is preceded by a 'companion' in the preset word; whether the first medical data is preceded by an even in the preset word; whether an omitted word exists in the preset words in front of the first medical data or not; whether a verb representing a behavior exists in the preset word in front of the first medical data or not; whether a diagnosis exists in the preset words in front of the first medical data or not; whether an anatomical part exists in the preset words in front of the first medical data or not; whether the first medical data is the symptom or not in the preset word is judged; whether a pathological word exists in the preset word in front of the first medical data or not; whether a continuous concept punctuation segmentation mode exists in the preset words in front of the first medical data or not; whether the time is in the preset word in front of the first medical data or not; whether the preset words in the front of the first medical data have numbers or not; whether the first medical data is preceded by a verb in the preset word or not.

In an exemplary embodiment, the textual features between the first medical data and the second medical data comprise at least one of the following features: a distance between the first medical data and the second medical data; an order between the first medical data and the second medical data; a number of periods between the first medical data and the second medical data; a number of commas between the first medical data and the second medical data; the number of spaces or pause numbers between the first medical data and the second medical data; whether there is a "tie" between the first medical data and the second medical data; whether there is an "even" between the first medical data and the second medical data; whether there is a verb between the first medical data and the second medical data that represents a behavior; whether there is a backward-only negative word between the first medical data and the second medical data; whether there is an escape word between the first medical data and the second medical data; whether there is a negative word between the first medical data and the second medical data; whether there is a diagnosis between the first medical data and the second medical data; whether there is an anatomical region between the first medical data and the second medical data; whether there is a symptom between the first medical data and the second medical data; whether there is a lesion word between the first medical data and the second medical data; whether there is a pattern of continuous conceptual punctuation segmentation between the first medical data and the second medical data; whether there is a number between the first medical data and the second medical data; whether there is time between the first medical data and the second medical data; whether there is a verb between the first medical data and the second medical data.

In an exemplary embodiment, the sentence morphology features include at least one of the following features: whether the first medical data and the second medical data are in a paragraph; whether the first medical data and the second medical data are in one sentence; whether the first medical data and the second medical data are in a clause; whether the first medical data and the second medical data are in a paragraph and other medical data that is in-class with the first medical data or in-class with the second medical data does not exist in between; whether the first medical data and the second medical data are in one sentence, and other medical data which are the same as the first medical data or the second medical data do not exist in the middle; whether the first medical data and the second medical data are in one clause and other medical data that are in the same class as the first medical data or in the same class as the second medical data are not present in between.

Since each functional module of the medical data relationship mining apparatus 600 according to the exemplary embodiment of the present invention corresponds to the step of the above-described exemplary embodiment of the medical data relationship mining method, it is not described herein again.

It should be noted that although several modules or units of the medical data relationship mining apparatus are mentioned in the above detailed description, such partitioning is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the invention. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Fig. 7 shows a schematic diagram of an exemplary system architecture 100 of a medical data relationship mining method or medical data relationship mining apparatus to which an embodiment of the invention may be applied.

As shown in fig. 7, the system architecture 100 may include one or more of

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

It should be understood that the number of terminal devices, networks, and servers in fig. 7 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may be various electronic devices having a display screen, including but not limited to smart phones, tablet computers, portable computers, desktop computers, and the like.

The server 105 may be a server that provides various services. For example, the user sends a request to the server 105 using the terminal device 103 (which may also be the terminal device 101 or 102). The server 105 may retrieve the matched search result from the database based on the relevant information carried in the request, and feed the search result back to the terminal device 103, so that the user may view based on the content displayed on the terminal device 103.

It should be noted that the electronic device 200 shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiments of the present invention.

As shown in fig. 8, the electronic apparatus 200 includes a Central Processing Unit (CPU)201 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)202 or a program loaded from a storage section 208 into a Random Access Memory (RAM) 203. In the RAM 203, various programs and data necessary for system operation are also stored. The CPU201, ROM 202, and RAM 203 are connected to each other via a bus 204. An input/output (I/O) interface 205 is also connected to bus 204.

The following components are connected to the I/O interface 205: an input portion 206 including a keyboard, a mouse, and the like; an output section 207 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 208 including a hard disk and the like; and a communication section 209 including a network interface card such as a LAN card, a modem, or the like. The communication section 209 performs communication processing via a network such as the internet. A drive 210 is also connected to the I/O interface 205 as needed. A removable medium 211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 210 as necessary, so that a computer program read out therefrom is mounted into the storage section 208 as necessary.

In particular, according to an embodiment of the present invention, the processes described below with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the invention include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 209 and/or installed from the removable medium 211. The computer program, when executed by a Central Processing Unit (CPU)201, performs various functions defined in the methods and/or apparatus of the present application.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods, apparatus, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method as described in the embodiments below. For example, the electronic device may implement the steps shown in fig. 1.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiment of the present invention.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A medical data relationship mining method, comprising:

acquiring first medical data and second medical data in a target text, wherein the first medical data and the second medical data are extracted from the target text through an entity recognition algorithm;

acquiring the self feature of the first medical data, the self feature of the second medical data, the peripheral text feature, the syntactic dependency analysis feature and the sentence morphological feature of the first medical data and the second medical data, tiling all dimension feature values, putting specific feature values in fixed positions in feature vectors to form feature vectors of the first medical data and the second medical data, wherein the feature vectors are key grammatical structures; wherein the peripheral text features comprise anterior information text features of the first medical data, posterior information text features of the second medical data, and text features between the first medical data and the second medical data;

inputting the feature vectors into a trained classification model, and judging a target relation between the first medical data and the second medical data; the target relation is required to have a definite semantic type and can cover most of medical data relations, the definite semantic type comprises a negative relation, a time relation, a numerical value relation and an action relation, and the target relation comprises a negative word and medical data relation, a time and medical data relation, a numerical value and medical data relation, an anatomical part and medical data relation, an action and medical data relation and a relative and medical data relation;

if the trained classification model is a trained two-classification model, the trained two-classification model is trained in any one of the following ways:

performing task disassembly on the target medical task, and determining a target relation;

acquiring first training medical data and second training medical data with the target relation in a training corpus, and labeling the first training medical data and the second training medical data, wherein the labeling mode is to label yes or no of a specific relation type;

extracting features of the first training medical data and the second training medical data to obtain feature vectors of the first training medical data and the second training medical data;

training a classification model using the feature vectors of the first and second training medical data; or

Abstracting a medical data relation classification system, and designing a characteristic set of a two-classification model to train the two-classification model;

if the trained classification model is a trained multi-classification model, the trained multi-classification model is trained in the following way:

acquiring first training medical data and second training medical data in a training corpus, and labeling the first training medical data and the second training medical data, wherein the labeled content comprises a target relationship between the first training medical data and the second training medical data;

training a multi-classification model using the feature vectors of the first and second training medical data.

2. The medical data relationship mining method according to claim 1, wherein the intrinsic feature of the first medical data includes at least one of:

whether the first medical data is a diagnosis;

whether the first medical data is an anatomical site;

whether the first medical data is a symptom;

whether the first medical data is a pathological word;

whether the first medical data is a negative word;

whether the first medical data comprises a verb;

whether the first medical data contains a number;

whether the length of the first medical data is larger than a preset byte or not;

whether the first medical data contains a time word.

3. The medical data relationship mining method according to claim 1, wherein the previous information text feature of the first medical data includes at least one of the following features:

whether a period number exists in a preset word in front of the first medical data or not is judged;

whether a comma exists in the preset words in front of the first medical data or not;

whether a space or a pause number exists in the preset word in front of the first medical data or not;

whether negative words exist in the preset words in front of the first medical data or not is judged;

whether a backward-acting negative word exists in the preset words in front of the first medical data or not;

whether the first medical data is preceded by a 'companion' in the preset word;

whether the first medical data is preceded by an even in the preset word;

whether an omitted word exists in the preset words in front of the first medical data or not;

whether a verb representing a behavior exists in the preset word in front of the first medical data or not;

whether a diagnosis exists in the preset words in front of the first medical data or not;

whether an anatomical part exists in the preset words in front of the first medical data or not;

whether the first medical data is the symptom or not in the preset word is judged;

whether a pathological word exists in the preset word in front of the first medical data or not;

whether a continuous concept punctuation segmentation mode exists in the preset words in front of the first medical data or not;

whether the time is in the preset word in front of the first medical data or not;

whether the preset words in the front of the first medical data have numbers or not;

whether the first medical data is preceded by a verb in the preset word or not.

4. The medical data relationship mining method according to claim 1, wherein the textual features between the first medical data and the second medical data include at least one of:

a distance between the first medical data and the second medical data;

an order between the first medical data and the second medical data;

a number of periods between the first medical data and the second medical data;

a number of commas between the first medical data and the second medical data;

the number of spaces or pause numbers between the first medical data and the second medical data;

whether there is a "tie" between the first medical data and the second medical data;

whether there is an "even" between the first medical data and the second medical data;

whether there is a verb between the first medical data and the second medical data that represents a behavior;

whether there is a backward-only negative word between the first medical data and the second medical data;

whether there is an escape word between the first medical data and the second medical data;

whether there is a negative word between the first medical data and the second medical data;

whether there is a diagnosis between the first medical data and the second medical data;

whether there is an anatomical region between the first medical data and the second medical data;

whether there is a symptom between the first medical data and the second medical data;

whether there is a lesion word between the first medical data and the second medical data;

whether there is a pattern of continuous conceptual punctuation segmentation between the first medical data and the second medical data;

whether there is a number between the first medical data and the second medical data;

whether there is time between the first medical data and the second medical data;

whether there is a verb between the first medical data and the second medical data.

5. The medical data relationship mining method according to claim 1, wherein the sentence morphological feature comprises at least one of the following features:

whether the first medical data and the second medical data are in a paragraph;

whether the first medical data and the second medical data are in one sentence;

whether the first medical data and the second medical data are in a clause;

whether the first medical data and the second medical data are in a paragraph and other medical data that is in-class with the first medical data or in-class with the second medical data does not exist in between;

whether the first medical data and the second medical data are in one sentence, and other medical data which are the same as the first medical data or the second medical data do not exist in the middle;

whether the first medical data and the second medical data are in one clause and other medical data that are in the same class as the first medical data or in the same class as the second medical data are not present in between.

6. A medical data relationship mining apparatus, comprising:

the medical data acquisition module is configured to acquire first medical data and second medical data in a target text, wherein the first medical data and the second medical data are extracted from the target text through an entity recognition algorithm;

a feature extraction module configured to obtain a feature of the first medical data, a feature of the second medical data, a peripheral text feature of the first medical data and the second medical data, a syntactic dependency analysis feature, and a sentence morphology feature, tile feature values of the dimensions, place a specific feature value in a fixed position in a feature vector, and form a feature vector of the first medical data and the second medical data, where the feature vector is a key grammatical structure; wherein the peripheral text features comprise anterior information text features of the first medical data, posterior information text features of the second medical data, and text features between the first medical data and the second medical data;

a target relation determination module configured to input the feature vector to a trained classification model and determine a target relation between the first medical data and the second medical data; the target relation is required to have a definite semantic type and can cover most of medical data relations, the definite semantic type comprises a negative relation, a time relation, a numerical value relation and an action relation, and the target relation comprises a negative word and medical data relation, a time and medical data relation, a numerical value and medical data relation, an anatomical part and medical data relation, an action and medical data relation and a relative and medical data relation;

7. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out a medical data relationship mining method according to any one of claims 1 to 5.

8. An electronic device, comprising:

one or more processors;

a storage device configured to store one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the medical data relationship mining method of any one of claims 1 to 5.