CN117854715A

CN117854715A - Intelligent diagnosis assisting system based on inquiry analysis

Info

Publication number: CN117854715A
Application number: CN202410263266.9A
Authority: CN
Inventors: 张莹宗; 黄镇; 李亚彭; 张彦周
Original assignee: Shenzhen Aidi Pharmaceutical Technology Co ltd; First Affiliated Hospital of Zhengzhou University
Current assignee: Shenzhen Aidi Pharmaceutical Technology Co ltd; First Affiliated Hospital of Zhengzhou University
Priority date: 2024-03-08
Filing date: 2024-03-08
Publication date: 2024-04-09
Anticipated expiration: 2044-03-08
Also published as: CN117854715B

Abstract

The invention relates to the technical field of natural language processing, in particular to an intelligent diagnosis assisting system based on inquiry analysis. The system comprises: acquiring node entities in an intelligent dialogue diagnosis and treatment data set and a knowledge graph database; decomposing a sentence into a plurality of word sequence segments; the probability of occurrence of the word sequence segment is obtained; obtaining the potential long tail entity degree of the word sequence fragments according to the dependency relationship among the words and the occurrence probability of the word sequence fragments; constructing a long tail entity matching mark sequence, and acquiring candidate long tail entities according to the long tail entity matching mark sequence and the potential long tail entity degree; calculating confidence coefficient of long-tail entity to form a long-tail entity set; and acquiring the link entity of the long-tail entity for the semantic space similarity of the node entity and the long-tail entity calculator, and constructing a knowledge graph based on the link entity to complete the intelligent diagnosis assisting system. The invention relieves the problem of semantic understanding difference caused by lower occurrence frequency of long tail entities.

Description

Intelligent diagnosis assisting system based on inquiry analysis

Technical Field

The invention relates to the technical field of natural language processing, in particular to an intelligent diagnosis assisting system based on inquiry analysis.

Background

Screening out a plurality of inspections or treatments after the intention diagnosis is obtained through the audible inquiry to form a medical knowledge base, and constructing an intelligent diagnosis assisting system through a natural language processing technology, wherein the intelligent diagnosis assisting system is a system capable of prompting the requirements of related inspections and treatments. Professional medical resources can be shared to various places in the whole country through the intelligent diagnosis assisting system, so that patients can be helped to obtain timely medical consultation and diagnosis, and valuable data support is provided for medical research. The named entity identification is an important step in constructing an intelligent diagnosis assisting system. However, when an intelligent diagnosis-assisting system is constructed, it is difficult to identify the entity in the medical field because of the long tail problem of the entity, i.e., the occurrence frequency of many rare diseases in training data is low. And the performance of the intelligent diagnosis assisting system constructed by the method is reduced.

Disclosure of Invention

In order to solve the technical problem that an entity cannot be identified due to the long tail problem, the invention provides an intelligent diagnosis assisting system based on inquiry analysis, and the adopted technical scheme is as follows:

the invention provides an intelligent diagnosis assisting system based on inquiry analysis, which comprises the following modules:

the diagnosis data acquisition module is used for acquiring the intelligent dialogue diagnosis and treatment data set and the node entity in the knowledge graph database;

the potential long tail entity degree acquisition module is used for decomposing each sentence in the intelligent dialogue diagnosis and treatment data set into a plurality of word sequence fragments; acquiring the occurrence probability of the word sequence fragments in the intelligent dialogue diagnosis and treatment data set according to the occurrence times of the word sequence fragments; obtaining the potential long tail entity degree of the word sequence fragments according to the dependency relationship among the words and the occurrence probability of the word sequence fragments in the intelligent dialogue diagnosis and treatment data set;

the long-tail entity set acquisition module is used for marking the word sequence segments as potential long-tail entities, constructing long-tail entity matching mark sequences according to the number of words in each sentence, and screening the potential long-tail entities according to the long-tail entity matching mark sequences and the potential long-tail entity degree to acquire candidate long-tail entities; acquiring long-tail entity confidence coefficients of the candidate long-tail entities according to semantic vectors of all words of the sentences, semantic vectors of words in the candidate long-tail entities, the lengths of the sentences and the lengths of the candidate long-tail entities, and forming a long-tail entity set by the candidate long-tail entities with the long-tail entity confidence coefficients larger than a preset threshold value;

the link entity acquisition module is used for calculating the normalized Google distance between the long-tail entity and the node entity according to the distribution condition of the long-tail entity and the node entity in the intelligent dialogue diagnosis and treatment data set; acquiring semantic space similarity of the long-tail entity and the node entity according to the normalized Google distance of the long-tail entity and the node entity and the semantic vectors of the long-tail entity and the node entity; obtaining a link entity of the long-tail entity according to the semantic space similarity;

the intelligent diagnosis assisting system construction module is used for acquiring a knowledge graph for inquiry according to the long tail entity and the link entity thereof to complete the intelligent diagnosis assisting system.

Preferably, the method for decomposing each sentence in the intelligent dialogue diagnosis and treatment data set into a plurality of word sequence segments comprises the following steps:

sentences are converted into word sequences using jieba segmentation, in which any number of consecutive words are noted as a word sequence segment.

Preferably, the method for obtaining the probability of occurrence of the word sequence segment in the intelligent dialogue diagnosis and treatment data set according to the occurrence times of the word sequence segment comprises the following steps:

presetting a first number, marking word sequence fragments consisting of a first number of continuous words as target word sequence fragments, marking the number which is one less than the first number as second number, marking the word sequence fragments consisting of a second number of continuous words as second word sequence fragments, and marking the second word sequence fragments as second target word sequence fragments if all words in the second word sequence fragments are in the target word sequence fragments and the second word sequence fragments do not contain the last word of the target word sequence fragments;

and acquiring the frequency of occurrence of the target word sequence segment and the second target word sequence segment in the intelligent dialogue diagnosis and treat data set, and recording the ratio of the frequency of occurrence of the target word sequence segment in the intelligent dialogue diagnosis and treat data set to the frequency of occurrence of the second target word sequence segment in the intelligent dialogue diagnosis and treat data set as the probability of occurrence of the target word sequence segment in the intelligent dialogue diagnosis and treat data set.

Preferably, the method for obtaining the potential long tail entity degree of the word sequence segment according to the dependency relationship among the words and the probability of the word sequence segment in the intelligent dialogue diagnosis and treat data set comprises the following steps:

dependency syntactic analysis is used for obtaining dependency relations among vocabularies in the intelligent dialogue diagnosis and treatment data set, and the expression of the potential long tail entity degree is as follows:

in the method, in the process of the invention,representing word sequence segment->Probability of occurrence in intelligent dialogue diagnosis and treatment data set, < >>Representing word sequence segment->Probability of occurrence in intelligent dialogue diagnosis and treatment data set, < >>Representing the dependency relationship between the i-th word and the j-th word, if there is a dependency relationship, the value is 1, if there is no dependency relationship, the value is 0, N represents the length of the word sequence segment,/-, and #>Representing the potential long tail entity of a word sequence segment ending with the ith word and having a length of N.

Preferably, the method for constructing the long tail entity matching tag sequence according to the number of words in each sentence comprises the following steps:

and (3) the length of the long-tail entity matching mark sequence is the same as the number of words in the sentence, each word of the sentence corresponds to a value in the long-tail entity matching mark sequence, and each element at the beginning of the long-tail entity matching mark sequence is marked as 0.

Preferably, the method for screening the potential long tail entity to obtain the candidate long tail entity according to the long tail entity matching mark sequence and the potential long tail entity degree comprises the following steps:

extracting potential long tail entities sequentially from large to small according to the degree of the potential long tail entities, extracting one potential long tail entity each time, finding out corresponding elements in a long tail entity matching mark sequence, if all the elements are 0, storing the potential long tail entities, and marking all the elements as 1; if one of the elements is not '0', the potential long-tail entity is not stored, and the next potential long-tail entity is continuously traversed until all the potential long-tail entities are traversed;

and marking the stored potential long tail entities as candidate long tail entities, wherein all the candidate long tail entities form a long tail entity candidate set.

Preferably, the method for obtaining the confidence coefficient of the long tail entity of the candidate long tail entity according to the semantic vectors of all the words of the sentence, the semantic vectors of the words in the candidate long tail entity, the length of the sentence and the length of the candidate long tail entity comprises the following steps:

the GloVe word vector model is used for obtaining semantic vectors of each word in sentences, and the expression of the confidence coefficient of the long-tail entity is as follows:

in the method, in the process of the invention,semantic vector representing the j-th vocabulary in a sentence, < ->Word sequence length representing sentence, ++>Representing the length of the r-th candidate long-tail entity in the long-tail entity candidate set s in the sentence,/->Semantic vectors representing the kth vocabulary in the kth candidate long-tail entity in the long-tail entity candidate set s in sentences,/->Representing the calculation of cosine similarity of the two, +.>And (5) representing the confidence level of the long-tail entity of the r-th candidate long-tail entity in the long-tail entity candidate set s in the sentence.

Preferably, the method for calculating the normalized Google distance between the long-tail entity and the node entity according to the distribution condition of the long-tail entity and the node entity in the intelligent dialogue diagnosis and treatment data set comprises the following steps:

in the method, in the process of the invention,representing that the p-th long tail entity of the final long tail entity set is intelligentSentence number in dialogue diagnosis and treatment data set, +.>Representing the number of sentences of the q-th node entity in the knowledge graph in the intelligent dialogue diagnosis and treatment data set,/->Representing the number of sentences and +.A.about.f. of the q-th node entity and the p-th long tail entity of the final long tail entity set in the knowledge graph, which occur simultaneously in the intelligent dialogue diagnosis and treat data set>And (3) representing the normalized Google distance between the q-th node entity and the p-th long-tail entity of the final long-tail entity set in the knowledge graph, wherein F represents the total number of sentences in the intelligent dialogue diagnosis and treatment data set.

Preferably, the method for obtaining the semantic spatial similarity of the long-tail entity and the node entity according to the normalized Google distance of the long-tail entity and the node entity and the respective semantic vectors of the long-tail entity and the node entity comprises the following steps:

semantic vectors of long-tail entities and node entities are acquired by using a mode of acquiring semantic vectors of words;

and calculating cosine similarity of the semantic vector of the long-tail entity and the semantic vector of the node entity, and taking the ratio of the cosine similarity to the normalized Google distance of the long-tail entity and the node entity as the semantic space enhancement similarity between the long-tail entity and the node entity.

Preferably, the method for obtaining the link entity of the long-tail entity according to the semantic spatial similarity comprises the following steps:

for each long-tail entity, calculating semantic space similarity between the long-tail entity and all node entities, and taking the node entity with the largest semantic space similarity as a link entity of the long-tail entity.

The invention has the following beneficial effects: according to the method, the distribution condition of the words in the sentences is analyzed, the potential long-tail entity degree of different continuous fragments is calculated, and therefore the long-tail entity candidate set of the sentences is screened out. Further, a final long-tail entity set of sentences is determined by calculating the confidence of each candidate long-tail entity word. And according to the identified long-tail entity set, carrying out entity link on the long-tail entity set through a knowledge graph database in the disclosed medical field so as to enrich the information of the long-tail entity, and constructing an intelligent diagnosis assisting knowledge graph based on the information, thereby realizing an intelligent diagnosis assisting system based on inquiry analysis. The named entity identification is a key step for constructing the knowledge graph, and the accuracy of the knowledge graph directly influences the performance of the intelligent diagnosis assisting system, so that the long-tail entity problem in the medical field is identified through the method, and semantic enhancement is carried out on the long-tail entity problem through an entity linking method, so that the understanding capability of the intelligent diagnosis assisting system on the long-tail entity can be effectively improved, and the problem of semantic understanding difference caused by lower occurrence frequency of the long-tail entity is solved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an intelligent diagnosis-assisting system based on inquiry analysis according to one embodiment of the present invention;

fig. 2 is a flowchart of an implementation of an intelligent diagnosis-assisting system based on inquiry analysis according to an embodiment of the present invention.

Detailed Description

In order to further describe the technical means and effects adopted by the invention to achieve the preset aim, the following detailed description is given below of the specific implementation, structure, characteristics and effects of the intelligent diagnosis assisting system based on the inquiry analysis according to the invention by combining the accompanying drawings and the preferred embodiment. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

An intelligent diagnosis assisting system embodiment based on inquiry analysis:

the following specifically describes a specific scheme of the intelligent diagnosis assisting system based on inquiry analysis provided by the invention with reference to the accompanying drawings.

Referring to fig. 1, a flowchart of an intelligent diagnosis assisting system based on inquiry analysis according to an embodiment of the present invention is shown, where the system includes: the system comprises a diagnosis data acquisition module, a potential long-tail entity degree acquisition module, a long-tail entity set acquisition module, a link entity acquisition module and an intelligent diagnosis assisting system construction module.

The diagnosis data acquisition module is used for acquiring an intelligent diagnosis assisting system for inquiry analysis, and firstly, a subject described in an inquiry sentence, namely, entities such as disease names, medicine names and the like in the sentence needs to be understood. In order to improve the performance of the intelligent diagnosis-assisting system, the embodiment analyzes the data set in the medical field, the data set in the medical field can be obtained from the Chinese medical information processing evaluation benchmark CBLUE, the data set containing a plurality of task requirements, and the intelligent dialogue diagnosis-treating data set is obtained based on the data set. And then acquiring the node entity from a knowledge graph database drug bank.

So far, the node entities in the intelligent dialogue diagnosis and treatment data set and the knowledge graph database are obtained.

The potential long tail entity degree obtaining module obtains different entities and node entities through intelligent dialogue diagnosis and treatment data sets to obtain similar entities because the node entities obtained by the same disease name or medicine name in the knowledge graph database drug bank only have one entity, and in reality, different disease names and medicine names also have aliases, which do not exist in the knowledge graph database drug bank.

Since chinese text does not have a boundary indicator like an explicit indicator word in english text, the first step in named entity recognition is to determine the word boundary, i.e. word segmentation. In the embodiment, the original text sentence is marked as a word sequence by adopting an accurate mode word segmentation result in the jieba Chinese word segmentation tool.

An N-gram model is constructed according to intelligent dialogue diagnosis and treatment data of inquiry data, the N-gram model is a natural language processing model based on statistics, and the probability of occurrence of word fragments consisting of N continuous words is calculated for each sentence. For each sentence, the word sequence after word segmentation is recorded as. For example for sentences: "I have more recently frequently headache, cough and fever. The word sequence after word segmentation of "is w= {" i "," recent "," frequent "," headache "," cough "," and "," fever ",". "}, wherein->Corresponding "me". The N-Gram model is commonly used for binary Bi-Gram and ternary Tri-Gram. Taking Tri-Gram as an example here, i.e. when two words have appeared, the probability of occurrence formula for the third word is as follows:

in the method, in the process of the invention,representing word sequence segment->Number of consecutive occurrences in intelligent dialogue diagnosis and treatment dataset, < >>Representing word sequence segment->Number of consecutive occurrences in intelligent dialogue diagnosis and treatment dataset, < >>Representing word sequence segment->Probability of continuous occurrence in the intelligent dialogue diagnosis and treatment dataset.

The probability of continuous occurrence of the word sequence segments with different lengths can be calculated through the mode, and the probability of continuous occurrence of any word sequence segment in each sentence in the intelligent dialogue diagnosis and treatment data set is calculated through the formula. The word sequence segment is a segment formed by a plurality of words. Then, a dependency syntax analysis is performed on the sentence, wherein the dependency syntax analysis aims at identifying the syntax relationship between each word in one sentence, such as a main name, a fixed shape, a moving guest, and the like. Here, the input of the dependency syntax analysis is usually a sentence, the output is the dependency relationship between each word and other words, the dependency syntax analysis is a well-known technique, and no description is repeated.

When the occurrence probability of the word sequence segment is low and the dependency relationship exists among the words in the sequence segment, the word sequence segment is considered to be a long tail entity, and the potential long tail entity degree is calculated according to the following formula:

in the method, in the process of the invention,representing word sequence segment->Probability of continuous occurrence in intelligent dialogue diagnosis and treatment data set, < >>Representing word sequence segmentsProbability of continuous occurrence in intelligent dialogue diagnosis and treatment data set, < >>Representing the ith word and the jth wordThe dependency relationship between words is 1 if there is a dependency relationship, 0 if there is no dependency relationship, N represents the length of the word sequence segment, +.>Representing the potential long tail entity of a word sequence segment ending with the ith word and having a length of N. The long tail entity is an entity with not abundant structural information in the knowledge graph.

Wherein, the word sequence segment ending with the i-1 th word and having a length of N-1 in the sentenceRecorded as word sequence segment->Word sequence segment with the ith word as ending length NRecorded as word sequence segment->；

The larger the probability ratio of occurrence in the intelligent dialogue diagnosis and treatment data set is and the word sequence segment isThe last word->And fragment->When the dependency degree between all words is larger, the word sequence segment ending with the ith word and having a length of N is describedThe less common; word sequence segment->Is the last word and word sequence segment +.>The greater the word association, the more likely the word sequence segment is a potentially long tail entity;

when the i-1 th word is used as the end length of the word sequence segment with the length of N-1 in the sentenceAnd a word sequence segment N ending with the ith word ++>The smaller the probability ratio of occurrence in the intelligent dialogue diagnosis and treatment data set is +.>Last word and->The smaller the dependency between words in (a) is, the description is about the consecutive segment ++N ending with the i-th word and having a length N>The more common; word sequence segment->Last word and->The smaller the association of a word in (a), the less likely the word sequence segment is a potential long tail entity.

Thus, the potential long tail entity degree of each word sequence segment is obtained.

And the long-tail entity set acquisition module acquires the potential long-tail entity degree of each word sequence segment through the steps, wherein each word sequence segment is a potential long-tail entity, and the potential long-tail entity degree of each continuous segment indicates the possibility that the potential long-tail entity is the long-tail entity. In calculating the potential long tail entity degree, words between different potential long tail entities may overlap, and the same word belongs to only one entity, for example: potential long tail entityAnd potential long tail entity->Chinese words->And->Is repeated and can be only in one entity, a plurality of potential long tail entities exist in the same sentence, and the lengths of the potential long tail entities are different, for example, the potential long tail entities are +.>And potential long tail entity->Are all potential long-tail entities in a sentence, and a long-tail entity candidate set is constructed according to the potential long-tail entity degree of the potential long-tail entities, and the method is as follows:

all potential long-tail entities in a sentence are ordered from large to small according to the degree of the potential long-tail entities, and a long-tail entity matching tag sequence is created, wherein the length of the long-tail entity matching tag sequence is identical to the word sequence length of the sentence, namely, each element of the long-tail entity matching tag sequence corresponds to one word in the sentence, and each element in the long-tail entity matching tag sequence is firstly marked as 0.

And sequentially selecting potential long tail entities according to the sequence from large to small of the potential long tail entity degree, firstly selecting the potential long tail entity with the largest potential long tail entity degree, if all the words contained in the potential long tail entity at this time are corresponding to all elements in the long tail entity matching standard sequence as 0, storing the potential long tail entity, and replacing all the words contained in the potential long tail entity with 1, then continuing traversing the potential long tail entity degree, if all the words contained in the potential long tail entity are corresponding to all the elements in the long tail entity matching standard sequence as 0, discarding the potential long tail entity, and not replacing the elements until all the potential long tail entities are traversed, and at this time, marking the stored potential long tail entity as a long tail entity candidate set, and marking each potential long tail entity in the long tail entity candidate set as a candidate long tail entity.

Since long-tailed entities are entities that are indicated to occur less frequently, the constructed long-tailed entity candidate set typically contains many non-entities that occur less frequently in consecutive segments. And in the medical field, long tail entities are often complex entities composed of multiple words, such as drug names, disease names, and the like. These entities are typically composed of multiple words, and dependencies also exist within the entity. For example, the sentence "the patient is diagnosed with hypothyroidism," wherein "hypothyroidism" is a disorder entity, and there is a noun modification relationship between "thyroid" and "disorder" within the entity, and there is also a noun modification relationship between "hypothyroidism" and "disorder. Therefore, it is necessary to further confirm the confidence of long-tail entities in the long-tail entity candidate set by combining the context information:

in order to facilitate contextual semantic information of the reference entity, the text sentence needs to be converted into a semantic vector that is convenient for computation. Each sentence is used as input, and each word in the sentence is converted into a semantic vector capable of representing semantic information thereof by adopting a pre-trained GloVe word vector model, wherein the GloVe word vector model is a known technology, and is not repeated herein, and the expression of the confidence coefficient of the long-tail entity is as follows:

When the cosine similarity between the average semantic vector of the candidate long-tail entity and the average semantic vector of the sentence is higher, the semantic of the candidate long-tail entity is more consistent with that of the current sentence, namely, the entity confidence of the candidate long-tail entity is higher, otherwise, the semantic of the candidate long-tail entity is more inconsistent with that of the current sentence, namely, the confidence of the candidate long-tail entity is lower; the candidate long-tail entity with entity confidence higher than the threshold is taken as the long-tail entity set ST of sentences, and in this embodiment, the threshold is set to 0.5.

Thus, a long tail entity set of each sentence is obtained.

The link entity acquisition module is used for acquiring information about the entity, such as introduction explanation and the like, of the entity due to the fact that the occurrence frequency of the long-tail entity in the data set is low, so that the model cannot learn enough knowledge about the entity, and the performance of the final intelligent diagnosis assisting system is affected. Therefore, entity linking is carried out on the obtained long-tail entity, so that semantic enhancement on the long-tail entity is realized, and the expression meaning of the long-tail entity is richer. In addition, the entities such as the drug name and the disease name acquired through the intelligent dialogue diagnosis and treatment data set are called but are not called in the knowledge graph drug bank, so that entity links are required to be established, and the entities in the knowledge graph drug bank are linked with the called, so that enhancement is completed.

And integrating the long-tail entity sets of all sentences in the intelligent dialogue diagnosis and treatment data set into one set, and recording the set as a final long-tail entity set.

According to the distribution condition of the entities in the intelligent dialogue diagnosis and treatment data set, calculating the normalized Google distance between each long tail entity and the node entity in the knowledge graph, wherein the formula is as follows:

in the method, in the process of the invention,representing the number of sentences of the p-th long tail entity of the final long tail entity set in the intelligent dialogue diagnosis and treatment data set,/for>Representing the number of sentences of the q-th node entity in the knowledge graph in the intelligent dialogue diagnosis and treatment data set,/->Representing the number of sentences and +.A.about.f. of the q-th node entity and the p-th long tail entity of the final long tail entity set in the knowledge graph, which occur simultaneously in the intelligent dialogue diagnosis and treat data set>And (3) representing the normalized Google distance between the q-th node entity and the p-th long-tail entity of the final long-tail entity set in the knowledge graph, wherein F represents the total number of sentences in the intelligent dialogue diagnosis and treatment data set.

However, the normalized Google distance is only a method for calculating the semantic relevance metric through the entity word distribution condition, and cannot completely capture the semantic relationship between entity words. Therefore, on the basis of normalized Google distance, semantic space enhancement similarity between long-tail entities and node entities is calculated by combining semantic vectors, and the formula is as follows:

in the method, in the process of the invention,semantic vector representing the p-th long-tail entity in the long-tail entity set,/for>Semantic vector representing the q-th node entity in the knowledge-graph,>representing the calculation of cosine similarity of the two, +.>Normalized Google distance representing the qth node entity and the p-th long tail entity of the final long tail entity set in the knowledge graph,/the q>The semantic space of the q-th node entity and the p-th long-tail entity of the final long-tail entity set in the knowledge graph is represented to enhance similarity, and semantic vectors of the entities are obtained by using a word vector model.

For a node entity and a long tail entity, the smaller the normalized Google distance is, the stronger the correlation between the two entities is, the greater the semantic space enhancement similarity is, and on the contrary, the weaker the correlation between the two entities is, the smaller the semantic space enhancement similarity is; when the cosine similarity between two entities is larger, the two entities are indicated to be more similar in terms of semantics, namely, the semantic space enhancement similarity is larger, and otherwise, the two entities are indicated to be more distant in terms of semantics, namely, the semantic space enhancement similarity is smaller.

So far, the semantic space enhanced similarity between all long-tail entities in the long-tail entity set and the node entities in the knowledge graph can be obtained. For each long-tail entity, selecting a node entity with the greatest semantic space enhancement similarity from the knowledge graph as a link entity of the long-tail entity word, so that a link relation can be established between each long-tail entity and the selected link entity, and the semantic understanding of the long-tail entity can be enriched or more context information can be provided by utilizing the information in the knowledge graph in a subsequent task.

Specifically, the semantic space enhancement similarity between the long-tail entity and the node entity in the knowledge graph drug bank is calculated, and the entity with larger semantic space enhancement similarity between the knowledge graph drug bank and the long-tail entity is used as a link entity of the long-tail entity. For long-tail entities in each sentence, when a knowledge graph based on inquiry information is constructed, entity description information of a corresponding link entity is used as supplementary description of long-tail entity words, so that the problem of insufficient description information caused by low occurrence frequency of the long-tail entity words is solved.

Thus, the link entity of each long tail entity is obtained.

The intelligent diagnosis-assisting system construction module is used for carrying out named entity identification on the intelligent dialogue diagnosis and treatment data set by using a BERT-BiLSTM-CRF model, so that the entities in the text can be obtained, and the named entity identification is a known technology and is not repeated herein; and then, a CasRel relation extraction model is adopted to obtain the relation between the entities, the model can learn the relation between the entities, so that a graph structure related to the entities is obtained, wherein nodes of the graph are the entities and related attributes thereof, edges represent the connection relation between the two entities, and the graph structure is a knowledge graph based on inquiry analysis and is stored in a Neo4j graph database. Only the link relation between the entities is obtained in the knowledge graph based on the inquiry analysis, and no attribute of the entities is obtained.

The attributes of the entities exist in a knowledge graph database drug bank, and each long tail entity exists in the knowledge graph database drug bank and is provided with a corresponding link entity. For the knowledge graph based on inquiry analysis, each long-tail entity is found, the attribute of the link entity corresponding to the long-tail entity is given to the long-tail entity, all the long-tail entities corresponding to one link entity are used as a class, and the attribute and the relevance of each class of nodes are the same.

And finally, developing an intelligent diagnosis assisting system according to the knowledge graph, wherein the system comprises a user interface and a background service. Wherein at the user interface, the user can input a related inquiry question description, and the system transmits the question description to the background program for processing. In a background program, the system extracts the entity of the problem core through the received related inquiry problem described by the user, matches the entity in the knowledge graph and infers corresponding comments and suggestions. Finally, the reasoning result is returned to the user for viewing through the user interface, and the implementation flow chart of the intelligent diagnosis assisting system is shown in fig. 2.

It should be noted that: the sequence of the embodiments of the present invention is only for description, and does not represent the advantages and disadvantages of the embodiments. The processes depicted in the accompanying drawings do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.

Claims

1. The intelligent diagnosis assisting system based on the inquiry analysis is characterized by comprising the following modules:

2. The intelligent diagnosis-assistant system based on inquiry analysis as set forth in claim 1, wherein the method for decomposing each sentence in the intelligent dialogue diagnosis and treat data set into a plurality of word sequence segments is as follows:

3. The intelligent diagnosis-assistant system based on inquiry analysis as claimed in claim 1, wherein the method for obtaining the probability of occurrence of the word sequence segment in the intelligent dialogue diagnosis and treat data set according to the occurrence times of the word sequence segment is as follows:

4. The intelligent diagnosis-assistant system based on inquiry analysis of claim 1, wherein the method for obtaining the potential long tail entity degree of the word sequence segment according to the dependency relationship among words and the probability of the word sequence segment in the intelligent dialogue diagnosis and treat data set comprises the following steps:

in the method, in the process of the invention,representing word sequence segment->Probability of occurrence in intelligent dialogue diagnosis and treatment data set, < >>Representing word sequence segment->Probability of occurrence in intelligent dialogue diagnosis and treatment data set, < >>Representing the dependency relationship between the ith word and the jth word, and if there is a dependency relationship, the value is 1, and if there is no dependency relationshipIn the dependency, the value is 0, N represents the length of the word sequence segment, ++>Representing the potential long tail entity of a word sequence segment ending with the ith word and having a length of N.

5. The intelligent diagnosis-assisting system based on inquiry analysis as set forth in claim 1, wherein the method for constructing the long-tail entity matching tag sequence according to the number of words in each sentence is as follows:

6. The intelligent diagnosis-assisting system based on inquiry analysis as set forth in claim 5, wherein the method for screening the potential long tail entity to obtain the candidate long tail entity according to the long tail entity matching flag sequence and the potential long tail entity degree is as follows:

7. The intelligent diagnosis-assisting system based on inquiry analysis as set forth in claim 1, wherein the method for obtaining the confidence level of the long-tail entity of the candidate long-tail entity according to the semantic vectors of all the words of the sentence, the semantic vectors of the words in the candidate long-tail entity, the length of the sentence and the length of the candidate long-tail entity is as follows:

in the method, in the process of the invention,semantic vector representing the j-th vocabulary in a sentence, < ->Word sequence length representing sentence, ++>Representing the length of the r-th candidate long-tail entity in the long-tail entity candidate set s in the sentence,/->Semantic vectors representing the kth vocabulary in the kth candidate long-tail entity in the long-tail entity candidate set s in sentences,/->Representing the calculation of the cosine similarity of the two,and (5) representing the confidence level of the long-tail entity of the r-th candidate long-tail entity in the long-tail entity candidate set s in the sentence.

8. The intelligent diagnosis-assistant system based on inquiry analysis of claim 1, wherein the method for calculating normalized Google distance of long-tail entity and node entity according to distribution condition of long-tail entity and node entity in intelligent dialogue diagnosis and treat data set is:

9. The intelligent diagnosis-assisting system based on inquiry analysis as set forth in claim 1, wherein the method for obtaining semantic spatial similarity of the long-tail entity and the node entity according to the normalized Google distance of the long-tail entity and the node entity and the semantic vector of each of the long-tail entity and the node entity comprises:

10. The intelligent diagnosis-assisting system based on inquiry analysis as set forth in claim 1, wherein the method for obtaining the link entity of the long-tail entity according to semantic spatial similarity is as follows: