CN106874643B

CN106874643B - Method and system for automatically constructing knowledge base to realize auxiliary diagnosis and treatment based on word vectors

Info

Publication number: CN106874643B
Application number: CN201611222893.XA
Authority: CN
Inventors: 张文生; 牛景昊
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2016-12-27
Filing date: 2016-12-27
Publication date: 2020-02-28
Anticipated expiration: 2036-12-27
Also published as: CN106874643A

Abstract

The invention relates to a method and a system for automatically constructing a knowledge base to realize auxiliary diagnosis and treatment based on word vectors. Wherein, the method can comprise the following steps: obtaining a patient description; performing keyword matching on the patient description by using an expanded disease-disease related factor dictionary established based on the word vector, and extracting words and expressions related to medicine in the patient description; detecting whether the extracted words and expressions are in a standard disease-disease related factor dictionary; calculating a score of the disease in combination with a correlation score of the disease-related factor obtained from the expanded disease-related factor dictionary with respect to the disease based on the detection result; ranking scores of diseases; and determining the diseases according to the sequencing result. Therefore, the invention solves the technical problem of predicting the spoken disease description of the patient.

Description

Method and system for automatically constructing knowledge base to realize auxiliary diagnosis and treatment based on word vectors

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to a method and a system for automatically constructing a knowledge base based on word vectors to realize auxiliary diagnosis and treatment.

Background

Along with the rapid development of a plurality of doctor-patient online question-answer websites and mobile phone application services in the field of internet medical treatment, a question-answer pair is formed by spoken description of massive patient illness states and various comprehensive information and corresponding doctor diagnosis results, and a precious question-call knowledge base is formed. Since these records tend to be unstructured data and there are a large number of non-canonical medical terms resulting from spoken descriptions, there are many challenges to directly utilizing these data. At the same time, there is a lot of repetitive work in the patient case of online inquiry, which is a waste of valuable doctor human resources. If the artificial intelligence algorithm can be used for replacing doctors to make a preliminary diagnosis result, the inquiry efficiency can be greatly improved. This task can be summarized as: the newly input description of the comprehensive information of the patient about self sex, age, symptoms, disease history and the like is returned to the disease diagnosis result prediction of the patient by using statement analysis and related algorithms and combining with a pre-constructed domain knowledge graph.

The existing technical scheme mainly comprises the following two methods: 1. and returning the corresponding diagnosis result of the doctor by searching the question with the highest similarity with the description of the patient in the question-answer library. The main problems of the methods are that the disease information appearing in the description of the patient is not really analyzed, the similarity of texts cannot completely reflect the similarity of the disease condition of the patient, and the matching accuracy is poor. 2. And (3) by clicking the information such as symptoms and diseased parts related to the disease condition of the patient, overlapping the score corresponding to the disease marked by the information label pre-labeled by the expert, and finally returning a probability sequence of possible diseases. The problems of such methods are that manual scoring is extremely unstable and subjective, and a large amount of labor and time costs are consumed when the number of diseases to be labeled is large, and in addition, the diagnostic system cannot analyze and utilize information other than optional symptoms.

In view of the above, the present invention is particularly proposed.

Disclosure of Invention

In order to solve the above problems in the prior art, that is, to solve the technical problem of how to predict the spoken language disease description of the patient, the embodiment of the present invention provides a method for automatically constructing a knowledge base based on word vectors to implement auxiliary diagnosis and treatment. In addition, the embodiment of the invention also provides a system for automatically constructing the knowledge base based on the word vectors to realize auxiliary diagnosis and treatment.

In order to achieve the above object, according to one aspect of the present invention, the following technical solutions are provided:

a method for automatically constructing a knowledge base to realize auxiliary diagnosis and treatment based on word vectors comprises the following steps:

obtaining a patient description;

performing keyword matching on the patient description by using an expanded disease-disease related factor dictionary established based on the word vector, and extracting words and expressions related to medicine in the patient description;

detecting whether the extracted words and expressions are in a standard disease-disease related factor dictionary;

calculating a score of the disease in combination with a correlation score of the disease-related factor obtained from the expanded disease-related factor dictionary with respect to the disease based on the detection result;

ranking scores of diseases;

and determining the diseases according to the sequencing result.

Further, the expanded disease-disease related factor dictionary may be built by:

training a word vector embedding distributed representation model about the disease-disease related factor using the medical information;

and embedding a distributed representation model based on the word vector, expanding the standard disease-disease related factor dictionary by using a distance measurement method, and establishing an expanded disease-disease related factor dictionary.

Further, training the word vector embedding distributed representation model about the disease-disease related factor by using the medical information may specifically include:

acquiring a medical information training corpus;

cleaning the medical information training corpus;

counting high-frequency expression modes appearing in the records of the question-answering library, increasing the weight of the high-frequency expression modes in the word segmentation model, and performing Chinese word segmentation to obtain a training text;

training the training text to generate a word vector embedded distributed representation model.

Further, the relevance score of a disease-associated factor to a disease can be determined by:

embedding a distributed expression model based on the word vector, expanding a standard disease-disease related factor dictionary by using a distance measurement method, and establishing a replacement word list;

matching the disease-disease related factors in the medical information using the expanded disease-disease related factor dictionary and the replacement vocabulary, and calculating a relevance score of the disease related factors corresponding to the disease.

Further, matching the disease-disease related factors in the medical information by using the expanded disease-disease related factor dictionary and the alternative word list, and calculating a relevance score of the disease related factors corresponding to the disease may specifically include:

matching keywords with the doctor-patient question-answer records by using the expanded disease-disease related factor dictionary, and extracting medical related words and expressions in the doctor-patient question-answer records;

detecting whether the words and expressions related to medicine in the extracted doctor-patient question-answer records are in a standard disease-disease related factor dictionary or not;

if not, normalizing the extracted medical related words and expressions in the doctor-patient question-answer record into corresponding standard expressions according to the replacement word list;

counting the frequency of the co-occurrence of the diseases and the related factors thereof based on the standard expression to obtain a co-occurrence frequency recording matrix of the disease related factors and the diseases;

and obtaining a correlation score of the disease-related factors corresponding to the diseases by using a nonlinear transformation method based on the co-occurrence frequency recording matrix of the disease-related factors and the diseases.

Further, the method may further include:

detecting whether the extracted words and expressions are in a standard disease-disease related factor dictionary, which specifically comprises the following steps:

if not, normalizing the extracted words and expressions to corresponding standard expressions according to the replacement word list to obtain standardized disease related factors;

calculating the score of the disease by combining the correlation scores of the disease-related factors corresponding to the disease, which are obtained according to the expanded disease-related factor dictionary, based on the detection result, and specifically comprising the following steps:

calculating a score for the disease based on the normalized disease-related factors in combination with a relevance score for the disease based on the disease-related factors derived from the expanded disease-related factor dictionary.

Further, the relevance score of a disease-associated factor to a disease can be determined by the following formula:

wherein Score (i, j) indicates that the disease-associated factor corresponds to a correlation Score for the disease; p (D)_i|F_j) Representing a conditional probability of having a disease; d_iIndicates a disease; f_jRepresents a disease-associated factor; n is a radical of_iIndicating frequency of disease, N_i＝∑_jN_ij，N_ijIndicating the recording frequency.

Further, the score of the disease can be obtained by the following formula:

wherein, DS (D)_i) A score representing a disease; d_iIndicates a disease; w (F)_j) Representing the mapping weight of the disease category; score (i, j) indicates that the disease-associated factor corresponds to a correlation Score for the disease.

In order to achieve the above object, according to another aspect of the present invention, the following technical solutions are also provided:

a system for automatically constructing a knowledge base based on word vectors to realize auxiliary diagnosis and treatment can comprise:

an acquisition module for acquiring a patient description;

the extraction module is used for performing keyword matching on the patient description by utilizing the expanded disease-disease related factor dictionary established based on the word vector, and extracting words and expressions related to medicine in the patient description;

the detection module is used for detecting whether the extracted words and expressions are in a standard disease-disease related factor dictionary or not;

a calculation module for calculating a score of the disease based on the detection result in combination with a correlation score of the disease-related factor obtained from the expanded disease-related factor dictionary with respect to the disease;

the sorting module is used for sorting scores of diseases;

and the determining module is used for determining the diseases according to the sequencing result.

Further, the extraction module may further specifically include:

a word vector model building unit for training a word vector embedding distributed representation model about the disease-disease related factors using the medical information;

and the extended dictionary establishing unit is used for embedding the distributed representation model based on the word vector, expanding the standard disease-disease related factor dictionary by using a distance measurement method and establishing an extended disease-disease related factor dictionary.

Further, the word vector model establishing unit may specifically include:

the acquisition unit is used for acquiring the medical information training corpus;

the cleaning unit is used for cleaning the medical information training corpus;

the first statistical unit is used for counting high-frequency expression modes appearing in the records of the question-answering library, increasing the weight of the high-frequency expression modes in the word segmentation model, and performing Chinese word segmentation to obtain a training text;

and the generating unit is used for training the training text and generating a word vector embedded distributed representation model.

Further, the calculation module may further specifically include:

the first replacement word list establishing unit is used for embedding a distributed representation model based on the word vectors, expanding a standard disease-disease related factor dictionary by using a distance measurement method and establishing a replacement word list;

and a correlation score calculation unit for matching the disease-disease related factors in the medical information using the expanded disease-disease related factor dictionary and the replacement word list, and calculating a correlation score of the disease related factors corresponding to the disease.

Further, the correlation score calculating unit may specifically include:

the extraction unit is used for matching keywords with the doctor-patient question-answer records by utilizing the expanded disease-disease related factor dictionary and extracting medically related words and expressions in the doctor-patient question-answer records;

the detection unit is used for detecting whether the words and expressions related to the medicine in the extracted doctor-patient question-answer records are in a standard disease-disease related factor dictionary or not;

the first normalization unit is used for normalizing the medically related words and expressions in the extracted medical question-answer record into corresponding standard expressions according to the alternative word list when the words and expressions are not in the standard disease-disease related factor dictionary;

the second statistical unit is used for counting the frequency of the co-occurrence of the diseases and the related factors thereof based on the standard expression to obtain a co-occurrence frequency recording matrix of the disease related factors and the diseases;

and the nonlinear transformation unit is used for obtaining the correlation score of the disease-related factor corresponding to the disease by using a nonlinear transformation method based on the co-occurrence frequency recording matrix of the disease-related factor and the disease.

Further, the system comprises:

the second replacement word list establishing unit is used for embedding a distributed representation model based on the word vectors, expanding the standard disease-disease related factor dictionary by using a distance measurement method and establishing a replacement word list;

the detection module may specifically include:

the second normalization unit is used for normalizing the extracted words and expressions to corresponding standard expressions according to the replacement word list to obtain standardized disease related factors when the extracted words and expressions are not in the standard disease-disease related factor dictionary;

the calculating module may specifically include:

and the disease score calculating unit is used for calculating the score of the disease based on the standardized disease related factors and the relevance scores of the disease related factors corresponding to the disease, which are obtained according to the expanded disease-disease related factor dictionary.

The embodiment of the invention provides a method and a system for automatically constructing a knowledge base to realize auxiliary diagnosis and treatment based on word vectors. Wherein, the method can comprise the following steps: obtaining a patient description; performing keyword matching on the patient description by using an expanded disease-disease related factor dictionary established based on the word vector, and extracting words and expressions related to medicine in the patient description; detecting whether the extracted words and expressions are in a standard disease-disease related factor dictionary; calculating a score of the disease in combination with a correlation score of the disease-related factor obtained from the expanded disease-related factor dictionary with respect to the disease based on the detection result; ranking scores of diseases; and determining the diseases according to the sequencing result. The embodiment of the invention utilizes word vector distributed expression trained aiming at the medical field to establish an expanded disease-disease related factor keyword dictionary, and can utilize multi-source medical information comprising general medical data and spoken Internet doctor-patient question-answer records to learn and construct a disease knowledge map and analyze and process non-standardized and spoken patient disease description, so that the technical problem of predicting the spoken disease description of a patient is solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a method for automatically constructing a knowledge base based on word vectors to realize auxiliary diagnosis and treatment according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a system for automatically constructing a knowledge base based on word vectors to realize assisted diagnosis and treatment according to an embodiment of the present invention.

Detailed Description

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.

The basic idea of the embodiment of the invention is to use a word vector embedding technology to generate distributed representation of general medical information, online doctor-patient spoken question and answer records and disease-disease related factors in a patient case database, automatically construct a knowledge graph of the disease-disease related factors and further realize auxiliary diagnosis of the patient spoken disease description.

The terms or definitions to be explained are as follows:

disease-related factors: various factors that may cause, help judge, or contain certain disease information, such as: disease symptoms, disease history, age, signs of disease, sex, etc.

Word vector embedding: by using the method of "Distributed Representation", a word (or phrase) is represented by a continuous real number vector with a low dimension (for example, less than 1000 dimensions), so that the word can be distinguished or represented by the vectors, and the tasks of natural language processing such as text classification and relationship extraction can be processed.

Co-occurrence frequency: in a speech segment or document, the simultaneous occurrence of a certain word or concept is called a co-occurrence, and the number of the occurrences of these words in all representative documents, i.e. the co-occurrence frequency, is counted.

The embodiment of the invention provides a method for automatically constructing a knowledge base based on word vectors to realize auxiliary diagnosis and treatment. As shown in fig. 1, the method may include:

s100: a patient description is obtained.

S110: and performing keyword matching on the patient description by using the expanded disease-disease related factor dictionary established based on the word vector, and extracting medically related words and expressions in the patient description.

Wherein the expanded disease-related factor dictionary is established through steps S112 to S114.

S112: the word vector embedding distributed representation model for the disease-disease related factor is trained using medical information.

The medical information includes, but is not limited to, general medical information, doctor-patient question and answer records, patient cases, text data related to medical diseases, disease-related factors, and the like. General medical information includes, but is not limited to, medical literature (e.g., medical papers, medical patent literature), textbooks (especially medical textbooks), medical papers.

Preferably, the doctor-patient question-answer records are online doctor-patient spoken question-answer records.

Specifically, step S112 may include:

s1121: and acquiring the medical information training corpus.

The medical information corpus may include, but is not limited to, a question and answer library, a medical textbook, a case library, and the like.

S1122: and cleaning the medical information training corpus.

The purpose of this step is to remove meaningless characters.

S1123: and counting high-frequency expression modes appearing in the records of the question-answer library, increasing the weight of the high-frequency expression modes in the word segmentation model, and performing Chinese word segmentation to obtain a training text.

S1124: training the training text to generate a word vector embedded distributed representation model.

In the training process, the training corpora that can be used include, but are not limited to, on-line doctor-patient question and answer records, patient medical records, and textbooks. The embodiment of the invention trains and generates a word vector embedded representation model of the medical field by using, but not limited to, a word2vec open source tool (https:// github. com/danielfrg/word2vec) proposed by Mikolov Tomasi, and stores the word vector embedded representation model in a knowledge base. During the training process, a neural network or other training algorithm may be used. For methods related to word vector training, see application nos.: 201610179115.0, 201510096570.X, which is hereby incorporated by reference. Relevant representation learning field paper experiments show that more ideal word vectors can be obtained by larger training corpora.

For example, in practical applications, a 300-dimensional word vector of hundreds of thousands of expressions may be trained using word segmentation and washing text data, where the high frequency words are represented as:

< tumors 0.176907,0.470268, -0.008468 … total 300 vitamins >

< blood sugar 0.149234,0.278761, -0.474681 … total 300 vitamins >

< fever 0.184283,0.046142, -0.107758 … A total 300 vitamins >

< high fever: 0.204092,0.089622,0.0057266 … for 300 vitamins altogether >

< high fever: 0.366153,0.314256,0.073571 … Total 300 vitamins >

In some optional implementations, the step of training the training text may further include: and performing low-dimensional real number vector representation of high-frequency words in the training text.

The low dimension may be set according to actual conditions, and may be set to be less than 1000 dimensions, for example.

S114: and embedding a distributed representation model based on the word vector, expanding the standard disease-disease related factor dictionary by using a distance measurement method, and establishing an expanded disease-disease related factor dictionary.

It should be clear to those skilled in the art that in the process of establishing the expanded disease and disease related factor dictionary, a replacement word list can also be established, that is, the standard disease-disease related factor dictionary is expanded by using a distance measurement method based on the word vector embedded distributed representation model, and the replacement word list is established.

The medical expert combines with a specific prediction task to construct and maintain a standard disease-disease related factor dictionary, which refers to an authoritative textbook and a specified standard, and the medical expert makes and corrects and maintains a collection of diseases and disease factors, which is a term collection of standard disease-disease related factors and needs to be collated and maintained by combining with related information such as specific diseases and disease symptoms to be predicted, disease history, age, disease signs, sex and the like. For example, heart disease, depression may be an element in two standard disease dictionaries (sets), while insomnia, diabetes history may be an element in two standard disease-related factor dictionaries (sets).

Distance measurement methods include, but are not limited to, cosine distance, Euclidean distance, or other distance measurement methods.

For each element in the standard disease-disease related factor dictionary, a distance measurement method is used for calculation, k words or phrase expressions with the nearest distance in the word vector word list are found, and the k words or phrase expressions are recorded as the replacement of the element in the standard disease-disease related factor dictionary. Thereby creating a replacement vocabulary from the heterogeneous expression to the standard expression and simultaneously creating an extended disease-associated factor dictionary of the knowledge base. Namely: each replaceable element is added to the original standard disease-related factor dictionary to form an expanded disease-related factor dictionary. Where k represents a parameter that can be adjusted for a specific task and data.

The following describes in detail the process of obtaining the expanded dictionary of disease-related factors and the replacement word list by taking the term "fever" as an example of the standard symptom-related factors in the preferred embodiment, and specifically includes: step a1 to step A3.

Step A1: the expression mode of the word or phrase closest to the heating is calculated by using the cosine distance, and the high fever are obtained. Wherein the distance parameter k is 2.

Step A2: in the expanded disease-disease related factor dictionary, high fever and high fever are added, and the replacement word list of the standard disease related factor of fever is recorded to contain the high fever and the high fever.

Step A3: and embedding the trained medical field word vectors into a distributed representation model, and executing the same operation on each element in the standard disease-disease related factor dictionary to obtain an expanded disease-disease related factor dictionary and a replacement word list.

S120: and detecting whether the extracted words and expressions are in a standard disease-related factor dictionary.

In the step, if the extracted words and expressions are detected to be in the standard disease-disease related factor dictionary, the words and expressions are not processed; if not, normalizing the extracted words and expressions to corresponding standard expressions according to the replacement word list to obtain the relevant factors of the standardized diseases. The replacement word list is built by embedding a distributed representation model based on word vectors and expanding a standard disease-disease related factor dictionary by using a distance measurement method.

The non-processing step means performing the subsequent processing using the standardized disease-related factors in the standardized disease-related factor dictionary.

S130: and calculating the score of the disease based on the detection result and the correlation score of the disease-related factors corresponding to the disease obtained according to the expanded disease-related factor dictionary.

In this embodiment, when the extracted words and expressions are not detected in the standard disease-disease related factor dictionary, the extracted words and expressions are normalized to the corresponding standard expressions according to the replacement word list to obtain the standardized disease related factors; calculating a score for the disease based on the normalized disease-related factors in combination with a relevance score for the disease based on the disease-related factors derived from the expanded disease-related factor dictionary. When the extracted words and expressions are detected in the standard disease-related factor dictionary, scores of diseases are calculated using the standardized disease-related factors in the standard disease-related factor dictionary in combination with the correlation scores of the disease-related factors corresponding to the diseases obtained from the expanded disease-related factor dictionary.

Wherein the disease-associated factor is determined by steps S132 to S134 corresponding to the correlation score of the disease.

S132: and (3) based on the word vector embedded distributed representation model, expanding the standard disease-disease related factor dictionary by using a distance measurement method, and establishing a replacement word list.

S134: matching the disease-disease related factors in the medical information using the expanded disease-disease related factor dictionary and the replacement vocabulary, and calculating a relevance score of the disease related factors corresponding to the disease.

Specifically, step S134 may include:

s1341: and matching keywords with the doctor-patient question-answer records by using the expanded disease-disease related factor dictionary, and extracting medically related words and expressions in the doctor-patient question-answer records.

In a preferred embodiment, the step may use the expanded disease-disease related factor dictionary to perform keyword matching on the disease description and diagnosis result in the doctor-patient question-answer library, and extract medically related words and expressions in the doctor-patient question-answer record.

S1342: and detecting whether the medically related words and expressions in the extracted doctor-patient question-answer records are in a standard disease-disease related factor dictionary. If yes, go to step S1343; otherwise, step S1344 is performed.

The step detects whether the extracted related words and expressions are in a standard disease-disease related factor dictionary one by one, and if so, special treatment is not carried out; if not, normalizing to the corresponding standard expression according to the replacement word list.

S1343: no treatment is performed.

This step represents the subsequent processing using the standard expression in the standard disease-related factor dictionary.

S1344: and normalizing the words and expressions related to the medicine in the extracted doctor-patient question-answering records into corresponding standard expressions according to the alternative word list.

The step S1344 may further include: normalization of medically relevant words and expressions is performed when the words and expressions correspond to a plurality of standard diseases or disease-related factors.

Specifically, when a certain expression corresponds to a plurality of standard diseases or disease-related factors, the standard-related factor closest to the expression is determined to replace the expression, and the standard disease-related factor corresponding to the description of the patient is obtained.

As an example, when a certain word and expression corresponds to more than one standard disease or disease-related factor, the cosine distance or euclidean distance is used, but not limited to, to calculate and find the closest standard concept to it, to replace the current expression, i.e. to perform the normalization of medically-related words and expressions.

For example, when an expression corresponds to more than one standard disease or disease-related factor, the cosine distance or Euclidean distance is used to calculate and find the closest standard-related factor to replace the current expression. The operation results in the input for the strip of patients, Q normalized disease-related factors included: { F₁,F₂,...F_j...F_Q}。

S1345: and counting the frequency of the co-occurrence of the diseases and the related factors thereof based on the standard expression to obtain a co-occurrence frequency recording matrix of the disease related factors and the diseases.

The standard disease-disease associated factor dictionary contains two elements: diseases and disease-related factors. For example, for m diseases, defined as { D }₁...D₂...D_i...D_mIs defined as { F } for n disease-associated factors₁...F...F_j...F_n}; will N_ijInitialized to zero. Record { R in P question-answer libraries₁...R₂...R_S...R_PIn, if R is_sIn the meantime appear D_iAnd F_jIs a reaction of N_ijThe frequency of 1 increase, i.e. co-occurrence of a certain disease and a certain disease-related factor, is recorded once. And counting the P records to obtain an m multiplied by n disease related factor and a co-occurrence frequency record matrix of the disease.

Wherein, P represents the number of records in the question-answer library; r₁,R₂...R_s...R_PRepresenting a question-answer library record; n is a radical of_ijIndicating the recording frequency.

S1346: and obtaining a correlation score of the disease-related factors corresponding to the diseases by using a nonlinear transformation method based on the co-occurrence frequency recording matrix of the disease-related factors and the diseases.

In the specific implementation process, the following steps are considered: in a certain record, the disease is knownCorrelation factor F_jPresent, then suffers from disease D_iHas a conditional probability of P (D)_i|F_j)＝N_ij/∑_iN_ij. The conditional probability can reflect the possibility of the disease-related factor to the disease to a certain extent, but is easily influenced by the cumulative effect of the high-frequency common diseases, so that the higher-order number of common diseases appearing in the record obtains extremely high conditional probability. Therefore, the final scoring function should also include an and N_i＝∑_jN_ijThe control parameter concerned. This is similar to the inverse document frequency idea used in the field of document classification.

Preferably, the relevance score of a disease-associated factor to a disease can be determined by the following formula:

The above equation contains conditional probabilities and a non-linear transformation of the reciprocal disease frequency. Finally, each disease-related factor corresponds to at least one related disease, and the corresponding Score is represented by Score (i, j).

The above steps can automatically learn and construct a knowledge graph for predicting diseases by using the expanded disease-disease related factor dictionary, matching the disease-disease related factors in the medical information, calculating and storing the disease related factors in the knowledge graph to score the relevance of the diseases.

In a preferred embodiment, after step S1346, the method further comprises: the scoring function is periodically tested by the A/B test method and the disease-related factors are updated to correspond to the relevance score of the disease.

In the step, the data quality and quantity of the original question-answer library are considered to generate certain influence on the correlation score of the disease-related factors, and meanwhile, a large number of new records can be generated on an online medical inquiry platform every day. Therefore, the scoring function related to the disease correlation factor is stored in an off-line knowledge base, and the scoring function version with better effect is selected by the on-line A/B test to be connected on line regularly.

The training learning data of each version independently form a data version with a disease-related factor to a disease-related score, and the score of the correlation is not completely equal to the prior probability of a disease-related factor, so that the evaluation of a medical expert on the score is only referential, and whether the accuracy and the friendliness degree of disease determination can be improved or not is taken as a final evaluation index, and whether other versions are replaced or not is taken as a basis.

In the process of constructing the knowledge base, the existing knowledge base is combined, so that the condition of illness and basic information description input by a patient can be analyzed, and the function of possible illness can be given.

The process of obtaining a disease-related factor to disease-related score is described in detail below in a preferred embodiment. Wherein, the 'swelling and pain of throat', 'cold' and 'nasal obstruction and discharge' are in the standard dictionary. The process of obtaining the relevance score may include steps B1 through B5.

Step B1: the method comprises the steps of obtaining a question-answer pair of 'I swelling and pain in throat, having high fever, stuffy nose and running nose all the time for a few days, asking doctors what disease the doctors get' and 'possibly having cold' in an original question-answer library.

Step B2: the question and answer pairs are processed to match with sore throat, high fever, nasal obstruction and running nose and cold.

Step B3: according to steps S121 and S122, "fever high" is replaced with "fever" using the replacement word list.

Step B4: and matching the 3 disease-related factors and the 1 disease one by one, and counting the frequency of co-occurrence of the disease and the related factors to obtain a co-occurrence frequency recording matrix of the disease-related factors and the disease.

Step B5: determining a disease-associated factor corresponding to a disease-associated score according to the formula:

In a preferred embodiment, the score of the disease can be obtained by the following formula:

For example, in { F₁,F₂,...F_j...F_QIn (b), for each factor F associated with the patient profile_jCombining the classes of standardized disease-associated factors, using the following formula for each disease D associated therewith_iAnd (3) overlaying and scoring:

In the above formula, different factors have different confidence degrees for the disease prediction, so different disease category mapping weights are given according to different factors. Wherein the mapping relationship can be formulated by an expert according to the category attribute. For example: "smoking habits" are disease-related factors belonging to the lifestyle-related category; the fever belongs to disease-related factors of disease symptoms, and when calculation is carried out, the weight of the category is determined, and different disease category mapping weights are used.

S140: the scores of the diseases are ranked.

S150: and determining the diseases according to the sequencing result.

The scoring of suspected diseases using embodiments of the present invention is described in detail below in a preferred embodiment. The disease condition of the patient is described as "the patient suffers from hyperpyrexia continuously in the days, and the patient suffers from smoking habit". The process of getting the ranking may include steps C1 through C7.

Step C1: and (3) matching keywords of 'the disease continuously appears in a few days, has a smoking habit and is a disease obtained' by utilizing the expanded disease-disease related factor dictionary, and extracting 'high fever' and 'smoking habit'.

Step C2: the detection of "high fever" is present in the expanded disease-related factors dictionary and not in the standard disease-related factors dictionary.

Step C3: according to the replacement word list, the 'high fever' is replaced by 'fever'.

Step C4: and respectively determining mapping weights according to the categories of the fever and the smoking habit.

Step C5: determining the patient's score for different diseases according to the formula:

step C6: the scores for different diseases are ranked.

Step C7: the first three ranked diseases were exported: 0.143531 for acute pharyngitis, 0.129281 for acute tonsil enlargement and 0.062088 for tracheal diseases.

Although the foregoing embodiments describe the steps in the above sequential order, those skilled in the art will understand that, in order to achieve the effect of the present embodiments, the steps may not be executed in such an order, and may be executed simultaneously (in parallel) or in an inverse order, and these simple variations are within the scope of the present invention.

Based on the same technical concept as the method embodiment, the embodiment of the invention provides a system for automatically constructing a knowledge base based on word vectors to realize auxiliary diagnosis and treatment. The system for automatically constructing the knowledge base based on the word vectors to realize the auxiliary diagnosis and treatment can execute the embodiment of the method for automatically constructing the knowledge base based on the word vectors to realize the auxiliary diagnosis and treatment. As shown in fig. 2, the system 20 may include: an acquisition module 21, an extraction module 22, a detection module 23, a calculation module 24, a ranking module 25 and a determination module 26. Wherein the obtaining module 21 is used for obtaining a patient description. The extraction module 22 is configured to perform keyword matching on the patient description using the extended disease-disease related factor dictionary established based on the word vector, and extract medically related words and expressions in the patient description. The detection module 23 is used to detect whether the extracted words and expressions are in the standard disease-related factor dictionary. The calculation module 24 is configured to calculate a score of the disease based on the detection result and a correlation score corresponding to the disease of the disease-related factor obtained from the expanded disease-related factor dictionary. The ranking module 25 is used to rank the scores of the diseases. The determination module 26 is used for determining the disease according to the sorting result.

In a preferred embodiment, the extraction module may further specifically include: the device comprises a word vector model establishing unit and an extended dictionary establishing unit. Wherein the word vector model building unit is used for training the word vector embedding distributed representation model related to the disease-disease related factors by using the medical information. The extended dictionary establishing unit is used for embedding a distributed representation model based on the word vectors, extending the standard disease-disease related factor dictionary by using a distance measurement method and establishing an extended disease-disease related factor dictionary.

In a preferred embodiment, the word vector model building unit may specifically include: the device comprises an acquisition unit, a cleaning unit, a first statistic unit and a generation unit. The acquisition unit is used for acquiring the medical information training corpus. The cleaning unit is used for cleaning the medical information training corpus. The first statistical unit is used for counting high-frequency expression modes appearing in the records of the question-answering library, increasing the weight of the high-frequency expression modes in the word segmentation model, and performing Chinese word segmentation to obtain a training text. The generating unit is used for training the training text and generating a word vector embedded distributed representation model.

In a preferred embodiment, the calculation module may further specifically include: a first alternative word list establishing unit and a correlation scoring calculation unit. The first replacement word list establishing unit is used for embedding a distributed representation model based on the word vectors, expanding the standard disease-disease related factor dictionary by using a distance measurement method and establishing a replacement word list. The correlation score calculation unit is used for matching the disease-disease related factors in the medical information by using the expanded disease-disease related factor dictionary and the replacement word list, and calculating the correlation score of the disease related factors corresponding to the diseases.

In a preferred embodiment, the correlation score calculating unit may specifically include: the device comprises an extraction unit, a detection unit, a first normalization unit, a second statistic unit and a nonlinear transformation unit. The extraction unit is used for matching keywords with the doctor-patient question-answer records by utilizing the expanded disease-disease related factor dictionary and extracting medically related words and expressions in the doctor-patient question-answer records. The detection unit is used for detecting whether the words and expressions related to the medicine in the extracted doctor-patient question-answer records are in the standard disease-disease related factor dictionary or not. The first normalization unit is used for normalizing the medically related words and expressions in the extracted medical question-answer records into corresponding standard expressions according to the alternative word list when the words and expressions are not in the standard disease-disease related factor dictionary. The second statistical unit is used for counting the frequency of the co-occurrence of the diseases and the related factors thereof based on the standard expression to obtain a co-occurrence frequency recording matrix of the disease related factors and the diseases. The nonlinear transformation unit is used for obtaining the correlation score of the disease-related factor corresponding to the disease by using a nonlinear transformation method based on the co-occurrence frequency recording matrix of the disease-related factor and the disease.

In a preferred embodiment, the system may further comprise: a second replacement word list establishing unit; the second replacement word list establishing unit is used for embedding a distributed representation model based on the word vectors, expanding the standard disease-disease related factor dictionary by using a distance measurement method and establishing a replacement word list. The detection module may further include a second normalization unit; the second normalization unit is used for normalizing the extracted words and expressions to corresponding standard expressions according to the replacement word list to obtain the standardized disease-related factors when the extracted words and expressions are not in the standard disease-related factor dictionary. The calculation module may further include a disease score calculation unit; the disease score calculating unit is used for calculating the score of the disease based on the standardized disease related factors and the relevance score of the disease related factors obtained according to the expanded disease-disease related factor dictionary corresponding to the disease.

For the specific working process and related description of the system described above, reference may be made to the corresponding process in the foregoing method embodiment, which is not described herein again.

Those skilled in the art will appreciate that the above system for automatically constructing a knowledge base based on word vectors to realize assisted diagnosis and treatment may further include some other known structures, such as a processor, a controller, a memory, a bus, and the like, wherein the memory includes, but is not limited to, a random access memory, a flash memory, a read only memory, a programmable read only memory, a volatile memory, a non-volatile memory, a serial memory, a parallel memory, a register, and the like, the processor includes, but is not limited to, a single-core processor, a multi-core processor, a processor based on an X86 architecture, a CPLD/FPGA, a DSP, an ARM processor, an MIPS processor, and the like, and the bus may include a data bus, an address. Such well-known structures are not shown in fig. 2 in order to not unnecessarily obscure embodiments of the present disclosure. It should also be noted that the number of individual modules in fig. 2 is merely illustrative. The number of modules may be any according to actual needs.

It should be noted that the division of the modules is only an example, and in practical applications, another division manner may be provided. In addition, each module can be decomposed into other modules again, which is not described herein again. Each module can be implemented by hardware, software, or a combination of hardware and software. In practical applications, the modules may be implemented by a central processing unit, a microprocessor, a digital signal processor, a field programmable gate array, or the like. Exemplary hardware platforms for implementing the various modules may include platforms such as Intel x86 based platforms with compatible operating systems, Mac platforms, MACOS, iOS, Android OS, and the like.

It should be noted that the terms "first", "second", etc. used herein should not be construed as limiting the scope of the present invention in various forms.

The above-mentioned embodiments and experimental examples describe the technical solutions, implementation details and algorithm effectiveness of the present invention in detail. It should be understood that the above description is only exemplary of the present invention, and is not intended to limit the present invention, and any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A system for automatically constructing a knowledge base to realize auxiliary diagnosis and treatment based on word vectors is characterized by comprising the following steps:

an acquisition module for acquiring a patient description;

the extraction module is used for performing keyword matching on the patient description by utilizing an expanded disease-disease related factor dictionary established based on word vectors, and extracting words and expressions which are related to medicine in the patient description;

a detection module for detecting whether the extracted words and expressions are in a standard disease-disease related factor dictionary;

a calculation module for calculating a score of a disease based on the detection result in combination with a correlation score of a disease-related factor obtained from the expanded disease-related factor dictionary with respect to the disease;

a ranking module for ranking the scores of the diseases;

a determining module for determining the disease according to the sorting result;

the calculation module specifically comprises a first alternative vocabulary establishing unit and a correlation scoring calculation unit,

the first replacement word list establishing unit is used for embedding a distributed representation model based on a pre-established word vector, expanding the standard disease-disease related factor dictionary by using a distance measurement method and establishing a replacement word list;

the correlation score calculating unit specifically includes:

a first normalization unit for normalizing the medically relevant words and expressions in the extracted patient description into corresponding standard expressions according to the replacement vocabulary when the words and expressions are not in the standard disease-related factor dictionary;

the nonlinear transformation unit is used for obtaining a correlation score of the disease-related factor corresponding to the disease by using a nonlinear transformation method based on the disease-related factor and a co-occurrence frequency recording matrix of the disease;

wherein the disease-associated factor is determined by the following formula corresponding to a correlation score for the disease:

P(D_i|F_j)＝N_ij/∑_iN_ij

wherein Score (i, j) indicates that the jth disease-associated factor corresponds to a relevance Score for the ith disease; p (D)_i|F_j) Representing a conditional probability of having a disease; d_iTo represent(ii) an ith disease; f_jRepresents the jth disease-related factor; n is a radical of_iIndicates the frequency of co-occurrence of the i-th disease and its associated factors, N_i＝∑_jN_ij，N_ijIndicating the frequency of the i-th disease and the co-occurrence frequency of the i-th disease and the j-th correlation factor.

2. The system according to claim 1, wherein the extraction module specifically comprises:

and the extended dictionary establishing unit is used for embedding a distributed representation model based on the word vector, extending the standard disease-disease related factor dictionary by using a distance measurement method and establishing the extended disease-disease related factor dictionary.

3. The system according to claim 2, wherein the word vector model building unit specifically includes:

the cleaning unit is used for cleaning the medical information training corpus;