CN109215797B

CN109215797B - Method and system for extracting non-classification relation of traditional Chinese medicine medical case based on extended association rule

Info

Publication number: CN109215797B
Application number: CN201811031738.9A
Authority: CN
Inventors: 孙云杰; 陈阳
Original assignee: Shandong Management University
Current assignee: Shandong Management University
Priority date: 2018-09-05
Filing date: 2018-09-05
Publication date: 2022-04-08
Anticipated expiration: 2038-09-05
Also published as: CN109215797A

Abstract

The invention discloses a method for extracting non-classification relations of traditional Chinese medicine medical schemes based on extended association rules, which comprises the following steps: step (1): constructing a traditional Chinese medicine medical record database; step (2): extracting a triple template from the traditional Chinese medicine medical record database based on the expanded association rule; and (3): extracting a triple concept pair from the triple template by adopting a Bootstrapping algorithm; and (4): mapping each word into a word vector for the extracted triple concept pair by using a word2vec model; and (5): for each query concept word, generating a vector of the corresponding query concept word; calculating cosine similarity between the vector of the query concept word and all word vectors in the step (4); and (6): and sequencing the cosine similarity from high to low to obtain an ontology non-classification relation concept pair. The method overcomes the defect that the common association rule method can not obtain the specific non-classified relation name.

Description

Method and system for extracting non-classification relation of traditional Chinese medicine medical case based on extended association rule

Technical Field

The invention relates to a method and a system for extracting non-classification relations of traditional Chinese medicine medical plans based on an extended association rule.

Background

The ontology is an important knowledge base, and the rich semantic information contained in the ontology can provide important support for research and related applications in the fields of question-answering systems, information retrieval, semantic Web, information extraction and the like. Therefore, how to quickly and effectively construct the ontology has very important research value. Researchers have proposed a large number of methods for efficiently constructing ontologies from different perspectives, respectively. Generally, these ontology building methods can be divided into methods that are manually built and methods that are built using automated, semi-automated techniques.

The manual ontology approach often requires an ontology specialist to participate in the entire process of construction. The method has the defects of high construction cost, low efficiency, strong subjectivity, inconvenience in transplantation and the like, so that the method is gradually replaced by a large number of ontology construction methods based on automatic and semi-automatic technologies. The automatic and semi-automatic construction method does not need (or only needs a small amount of) manual participation, can conveniently use the latest research results of other research fields (such as machine learning, natural language processing and the like), and can also conveniently use different data sources to carry out ontology construction. The data source of the traditional Chinese medicine medical record has the characteristics of large data volume and semi-structured data.

In the process of ontology learning, the extraction complexity of ontology non-classification relations is greater than the extraction difficulty of ontology classification relations, and the ontology non-classification relations refer to other relations except ontology classification relations, such as overall and partial relations, association relations between people and places, and the like. In the extraction work of the ontology non-classification relationship, the current research work mainly focuses on judging whether the relationship exists between concepts, and the specific relationship between the concepts cannot be labeled more specifically. At present, there are three main application methods for extracting ontology non-classification relations for Chinese texts: the method is a pattern matching method based on lexical characteristics, a relation extraction method based on dependency grammar analysis and an analysis method based on statistical correlation rules. The pattern matching method based on the lexical characteristics mainly extracts the domain ontology relationship based on the known relationship type; the relation extraction method based on the dependency grammar analysis is suitable for extracting the field relation in the simple sentence; the main method used for extracting the domain relationship in the ontology is a mining method based on association rules, and the essence of the method is to extract the relationship by calculating the conditional probability obtained by combining concepts and verbs and between concepts and combining concepts.

In the prior art, ontology concepts are extracted by a method based on rules and word frequency statistics; the extraction of the body classification relation is mainly completed by the statistical analysis of a language model and the analysis of the syntactic characteristics of the material; pattern matching based on lexical rules, semantic relation extraction based on dependency syntax and association rule mining based on word frequency statistics are adopted to complete extraction of ontology non-classification relations; the prior art has the following defects:

1) the method based on the rules neglects the consideration of the word frequency, and the method based on the word frequency statistics neglects the consideration of the part-of-speech characteristics;

2) some linguistic data have obvious syntactic characteristics, some data are not obvious, and a part of data can be omitted by simply considering the syntactic characteristics, so that the recall rate of the extraction result of the ontology classification relationship is reduced;

3) the characteristic analysis needs to be carried out on a large amount of data, and the concept of the non-classified relation obtained by extracting the data with unobvious characteristics is limited to the data amount, so that the extraction of the ontology non-classified relation is limited to a certain extent.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a method and a system for extracting the non-classification relation of the traditional Chinese medicine medical case based on an extended association rule;

as a first aspect of the invention, a method for extracting non-classified relation of traditional Chinese medicine medical record based on extended association rule is provided;

the method for extracting the non-classification relation of the traditional Chinese medicine medical record based on the extended association rule comprises the following steps:

step (1): constructing a traditional Chinese medicine medical record database;

step (2): extracting a triple template from the traditional Chinese medicine medical record database based on the expanded association rule;

and (3): extracting a triple concept pair from the triple template by adopting a Bootstrapping algorithm;

and (4): after the extracted triple concept pairs are trained by using a word2vec model, the word2vec model maps each word into a word vector;

and (5): for each query concept word, generating a vector of the corresponding query concept word; calculating cosine similarity between the vector of the query concept word and all word vectors in the step (4);

and (6): and sequencing the cosine similarity from high to low to obtain an ontology non-classification relation concept pair.

As a further improvement of the invention, the step (1) comprises the following steps:

a step (101): constructing a Chinese medical record dictionary;

a step (102): based on a Chinese medicine medical record dictionary, performing word segmentation processing on the Chinese medicine medical record text by adopting an ICTCCLAS word segmentation tool;

step (103): removing stop words and number words and deleting meaningless words;

a step (104): generating the words obtained in the step (103) into candidate concepts of the traditional Chinese medicine medical record in a mutual information mode;

a step (105): and deleting incorrect traditional Chinese medicine medical record concepts, and adding missing concepts to obtain a final traditional Chinese medicine medical record database.

As a further improvement of the invention, the step (2) comprises the following steps:

association rules are used to find hidden relationships between different sets of data items in the transactional database.

Let R ═ { I1, I2, …, Im } be a set of a,

is a group of B sets, and the B sets,

x, Y are 2 groups of articles,

and is

Confidence of association rule X → Y

Indicating the probability of Y appearing on the premise of the appearance of the item set X, and the support of the association rule X → Y

Representing the probability of the simultaneous occurrence of the sets of items X, Y,

the above association rules can be used to analyze the credibility and support degree of the object set in the classification relationship, for example, assuming that the object set X is the most common Chinese medicine (such as Gypsum Fibrosum, flos Lonicerae, and radix Paeoniae alba). The item set Y is a heat-clearing and blood-cooling medicine (such as cortex moutan, radix rehmanniae, cornu bubali, and the like), and the relevance of the items is found by constructing a set R and a set T.

When the association rule is applied to the non-categorical relationship extraction, the item set is R ═ { c1, c2, …, cm }, where cm denotes the concept that non-categorical relationships may exist.

The transaction set is

ti represents a sentence containing at least 1 concept in R. X ═ { ci }, Y ═ cj } represents the 2 items to be subjected to association rule mining.

If it is

And

above a given threshold, it is indicated that there is a strong connection, i.e. a non-categorical relationship, between the concepts ci and cj.

The triple is a concept in a data structure, and represents a storage template in this example, a template meeting conditions is extracted from a large amount of text, specifically, a triple (x.y.z) is formed by a row and a column where non-zero elements are located and values of the row and the column, and then the triples are stored according to a specific rule, the rule is made as a core part of the extended association rule, specifically, the row represents that X represents a verb which represents a concept word and is mostly a behavior, such as 'enema', and Y represents a noun such as 'stool'. Z represents a result, such as 'rarefaction', if the sentence 'the defecate of a patient becomes rarefaction after the enema treatment' is met in a medical scheme, the result is collected as 'enema-defecate-rarefaction', 'sleep-abdominal-pain', 'treatment-leg-numbness', and the like, after the new concept pairs are extracted, the template relationship between the concept pairs is extracted again, and the relationship template is perfected. Different medical cases among the same group of concept pairs have different expression modes such as 'abdominal pain' and 'belly pain', so that the relationship among the concept pairs can be perfected by discovering new expression relationships among the concept pairs, more concept pairs can be extracted through discovered relationship templates, and the triple concept pairs are perfected.

The triple template in the step (2) refers to: extracting triplets (X, Y, Z) from a traditional Chinese medicine medical record database, wherein X is a verb; y is a noun; z is an adjective; verbs, nouns and adjectives in the triple template all appear in one sentence at the same time;

the rule of extraction is: setting an extraction window as 1, taking a certain noun Y as a reference, positioning the extraction window on the noun Y at an initial moment, then moving the window forwards, judging whether a word in the current window is a verb, if so, extracting the verb to be used as X, otherwise, continuing to move the window forwards until a word with the part of speech being the verb is extracted, and ending; if the window moves to the head of the sentence and still has no verb, ending;

similarly, taking a certain noun Y as a reference, initially positioning the extraction window on the noun Y, then moving the window backwards, judging whether the word in the current window is an adjective, if so, extracting the adjective to be used as Z, otherwise, continuing moving the window backwards until the word with the part of speech being the adjective is extracted, and ending; if the window moves to the tail of the sentence and still has no adjective, ending;

the extracted X, Y and Z are combined as a triplet template.

And (3) reestablishing the triple concept pairs which sufficiently represent the distribution of the triple templates by adopting a Bootstrapping algorithm and utilizing a limited triple template through repeated sampling for multiple times.

As a second aspect of the invention, a non-classified relation extraction system of the traditional Chinese medicine medical record based on the expanded association rule is provided;

the non-classification relation extraction system of the traditional Chinese medicine medical record based on the extended association rule comprises the following steps: the computer program product comprises a memory, a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of any of the above methods.

As a third aspect of the present invention, there is provided a computer-readable storage medium;

a computer-readable storage medium, having a computer program running thereon, the computer program, when executed by a processor, performing the steps of any of the above methods.

Compared with the prior art, the invention has the beneficial effects that:

the non-classified relations in the traditional Chinese medical scheme are extracted by different methods and compared and researched. And after non-categorical relation concept pairs are extracted by using association rules based on statistics, corresponding non-categorical relation names are extracted through linguistic rules. And after the basic template is extracted, the basic template is expanded according to a Bootstrapping algorithm, and a new concept pair and a new relation template are obtained after expansion. And (3) for the concept pair obtained after expansion, calculating the concept Word with the non-classification relation by using a vector space model obtained after Word2vec training and the concept in the triple, obtaining the similarity between the concept Word and the recommendation result after calculation, and completing the extraction of the concept pair with the ontology non-classification relation by screening the similarity. The extraction of the non-classified relation of the ontology is completed by a method for calculating the similarity of concepts, and the method overcomes the defect that the specific non-classified relation name cannot be obtained by a common association rule method.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

FIG. 1 is a flow chart of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

1 non-categorical relationship of the Chinese medical record

With the rapid development of internet technology, the network application of the traditional Chinese medicine industry becomes more and more extensive, and a large amount of data subjected to statistics and processing come with the network application, so that the traditional Chinese medicine medical record data can be fully utilized, and convenience is brought to the development of the traditional Chinese medicine industry. Is very necessary for the construction of the traditional Chinese medicine medical record ontology library. However, the extraction of the ontology relationship (especially the non-classification relationship) is especially difficult due to the characteristics of large number and non-standardization of the traditional Chinese medical records. The invention mainly provides an automatic learning mode based on an extended association rule. To extract the non-classification relationships in the medical records, as shown in FIG. 1.

The ontology non-classification relation refers to other relations except the hierarchical relation of the upper level and the lower level of the ontology, most researches on the ontology relation still remain on the extraction level of the ontology classification relation at present, and the researches on the ontology non-classification relation extraction by the research works both in China and abroad are few and few. The method comprises the steps of firstly extracting a basic template by adopting an extended association rule-based method, then expanding a triple template by using a Bootstrapping-based method according to the extracted template, finally carrying out word vectorization representation on word segmentation results of the traditional Chinese medicine medical records, and calculating the similarity between the concepts to obtain a concept set with a similarity relation.

2 characteristics of non-categorical relationships of the Chinese medical records

The non-categorical relationships in the ontology are the relationships contained in the topic after the out-categorical relationships. The extraction difficulty of the non-classification relation is higher than that of the classification relation. Ontology non-classification relations are relations (such as whole and part relations) except ontology relations, and currently, non-classification relations are mainly extracted through an association rule-based mining method. The method performs extraction by calculating the frequency of occurrence between two keywords, and is essentially an algorithm based on statistics. The traditional Chinese medical record is a medical record, which is a record of dialectical and prescription medication when the traditional Chinese medicine treats diseases, and the traditional Chinese medical record has different writing methods and different styles, and different doctors have different writing methods for the medical record. Modern traditional Chinese medicine medical records gradually draw the advantages of western medicine cases, are increasingly standardized, but still have unique characteristics. Inheriting the traditional Chinese context of TCM.

Template extraction based on extended association rules

3.1 concrete Algorithm

The algorithm for extracting the non-classified relation by using the extended association rule method is as follows:

(1) randomly taking 2 concepts c1 and c2 which are not analyzed by the association rule from the ontology concept set, and if no concept pair exists, turning to the step (7);

(2) calculated according to equations (1) and (2)

And

(3) if it is

And

if the threshold value is larger than the given threshold value, turning to the step (4), otherwise, turning to the step (1);

(4) counting all verbs appearing in concepts c1 and c2 in the sentence and the co-occurrence frequency thereof;

(5) if the co-occurrence frequency of a verb exceeds a given threshold, the verb is used as a non-categorical relation name of concepts c1 and c2, and the concept which appears before the verb for a plurality of times is used as a definition domain and the other concept is used as a value domain;

(6) after all verbs appear are checked, turning to the step (1);

(7) and (6) ending.

3.2 corpus selection

Because a standard corpus which can be used for ontology learning in the field of traditional Chinese medicine medical cases does not exist at present, the experimental corpus is selected from 20000 parts of text classification words in the medical cases. The concept extraction process is divided into the following three steps.

The first step is as follows: and (4) preprocessing.

Constructing a domain dictionary. And selecting a seed concept to construct a dictionary, and adding the dictionary into an ICTCCLAS word segmentation tool to avoid segmenting the domain concept into scattered strings.

② Chinese word segmentation. Segmenting Chinese medicine medical record text corpus into words, removing stop words and digital words, and filtering some meaningless words.

The second step is that: and extracting candidate concepts.

Generating the candidate concepts of the traditional Chinese medicine medical record through strategies such as mutual information, rule filtering and the like.

The third step: and (5) correcting by an expert.

And deleting incorrect traditional Chinese medicine medical record concepts by the domain experts, and adding missing concepts to obtain a final ontology concept set.

4 Experimental and results analysis

In the Chinese text, the extraction of ontology non-classification relations mostly adopts an association rule and a shallow semantic analysis method, and a Bootstrapping-based method is not applied to ontology relation extraction of the traditional Chinese medical record.

The Bootstrapping algorithm is an algorithm which repeatedly extracts limited templates and continuously iterates obtained new templates until convergence. For example, a triple concept of (enema, post, stool, rarefaction) can be initially extracted through the template "X post Y Z"; and then analyzing the extracted triple concepts to 'enema-stool-rarefaction' to obtain more matching templates which accord with the concept to the characteristics, and further performing repeated iteration through the newly found template to further extract and obtain more new triple instances.

The template used in the invention is obtained by extracting the association rule based on expansion of the data, and then the extension of the triple concept is realized by a Bootstrapping method.

In the extraction process of the triple template, due to the particularity (most of nouns) of the part of speech of the concept, in the process of extracting the triple concept template by using Bootstrapping, the invention sets the front and back extraction windows of the related words in the template as 1, extracts and combines the words in the window 1, the parts of speech of the related words in the template are nouns. And completing the extraction of new concept pairs in Bootstrapping. After the new concept pairs are extracted, the template relationships between the concept pairs are extracted again, and the relationship template is perfected. Different expression modes exist among the same group of concept pairs, so that the relationship among the concept pairs can be perfected through discovering new expression relationship among the concept pairs, more concept pairs can be extracted through a discovered relationship template, and the triple concept pairs are perfected.

5 ontological non-categorical relationship extraction

The method adopts Word2vec to train the text, each Word is mapped into a K-dimensional real number vector through training, and the semantic similarity between the words is determined through the distance (such as Euclidean distance, cosine similarity and the like) between the words. One core technology is Huffman coding based on word frequency, so that the word frequency of all words is similar, the more content activated by a hidden layer is, the higher the word frequency is, the less content is activated, and the lower the word frequency is, and thus, the computational complexity is effectively reduced.

At present, research on ontology non-classification relations at home and abroad is still few, and most of research on ontology non-classification relations still stays on the level of whether concept pairs have relations or not, but specific non-classification relations are not determined. And the most main method adopted by the domestic research on the ontology non-classification relationship is a rule-based method. Different from the traditional extraction method of ontology non-classification relations, the method realizes the extraction of the non-classification relations based on the word vector space on the basis of the obtained triples. Training of the traditional Chinese medicine medical record data and the newspaper data is completed by adopting Word2vec, and then the similarity between the concept pairs with certain similarity relation is calculated according to the vector mapped after the data training. And filtering and screening the high-low contrast of the similarity according to rules to obtain a final concept pair with a non-classified relation. And finishing the extraction of the non-classification relation of the ontology.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. The method for extracting the non-classification relation of the traditional Chinese medicine medical record based on the extended association rule is characterized by comprising the following steps of:

step (1): constructing a traditional Chinese medicine medical record database;

step (2): extracting a triple template from the traditional Chinese medicine medical record database based on the expanded association rule; the rule of extraction is: setting an extraction window as 1, taking a certain noun Y as a reference, positioning the extraction window on the noun Y at an initial moment, then moving the window forwards, judging whether a word in the current window is a verb, if so, extracting the verb to be used as X, otherwise, continuing to move the window forwards until a word with the part of speech being the verb is extracted, and ending; if the window moves to the head of the sentence and still has no verb, ending;

similarly, taking a certain noun Y as a reference, initially positioning the extraction window on the noun Y, then moving the window backwards, judging whether the word in the current window is an adjective, if so, extracting the adjective to be used as Z, otherwise, continuing moving the window backwards until the word with the part of speech being the adjective is extracted, and ending; if the window moves to the tail of the sentence and still has no adjective, ending; combining the extracted X, Y and Z to serve as a triple template;

2. The method of claim 1, wherein the method for extracting non-classified relations between TCM medical records based on extended association rules,

the step (1) comprises the following steps:

a step (101): constructing a Chinese medical record dictionary;

3. The method of claim 1, wherein the method for extracting non-classified relations between TCM medical records based on extended association rules,

the triple template in the step (2) refers to: extracting triplets (X, Y, Z) from a traditional Chinese medicine medical record database, wherein X is a verb; y is a noun; z is an adjective; verbs, nouns and adjectives in the triple template all appear in one sentence at the same time.

4. The method of claim 1, wherein the method for extracting non-classified relations between TCM medical records based on extended association rules,

5. The non-classification relation extraction system of the traditional Chinese medicine medical record based on the extended association rule is characterized by comprising the following steps: a memory, a processor, and computer instructions stored on the memory and executed on the processor, the computer instructions, when executed by the processor, performing the steps of the method of any of claims 1-4.

6. A computer-readable storage medium, on which a computer program is run, which computer program, when being executed by a processor, is adapted to carry out the steps of the method of any one of the preceding claims 1 to 4.