CN113764112A

CN113764112A - Online medical question and answer method

Info

Publication number: CN113764112A
Application number: CN202111085061.9A
Authority: CN
Inventors: 王成伟; 高中霞; 艾延永
Original assignee: Second Hospital of Shandong University
Current assignee: Second Hospital of Shandong University
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2021-12-07

Abstract

The invention belongs to the technical field of internet medical treatment, and relates to an online medical question and answer method. After receiving the patient questions, performing long-short sentence compression processing to obtain rewritten question sentences, performing word segmentation processing, dividing the question sentences into word sets, training a neural language model, performing vectorization processing on the word sets, further performing vectorization on the question sentences, extracting entity words, converting the entity words into standard words, and identifying the intention types of the patient according to a feature word library; generating a question analysis result and constructing a knowledge graph; and converting into a query statement, and querying through a knowledge graph to obtain an answer. The invention can effectively solve the problem that the patient is not professional in the state of illness and pain points which cannot be identified by a medical system, and by accurately identifying the inquiry intention of the patient, the patient recommends medicines, videos, treatment and the like by adopting a self-constructed knowledge graph aiming at the disease symptoms of the patient.

Description

Online medical question and answer method

Technical Field

The invention belongs to the technical field of internet medical treatment, and relates to an online medical question and answer method.

Background

With the rapid development of the internet, the requirements in the medical health field are rapidly improved, and especially in recent years, the application of the artificial intelligence technology in the medical field brings great help to people. The medical question-answering system is one of important applications in the field of medical health, has permeated daily life of people, and is an important way for helping people to know medical knowledge. However, the patients themselves have problems of insufficient medical knowledge, and cannot accurately describe their own disease information, and the questions of colloquial and description confusion exist during consultation, so that the current on-site question-answering system in the industry cannot provide appropriate answers.

The existing medical question-answering technical scheme is basically divided into the following three types:

one is through information extraction techniques. The method mainly extracts answers by matching keywords and some rules, and then carries out ranking by calculating the similarity.

And secondly, by knowledge mapping technology. The method mainly constructs entity relationship edges through entities of medical field knowledge to form a knowledge graph of a vertical field, and the core theory is that the search from a question to an answer is realized through knowledge reasoning.

And thirdly, deep learning technology is used. In recent years, deep learning technology is rapidly developed along with the improvement of the hardware level of a computer, and the deep learning technology achieves good results in the fields of computer vision, natural language processing and the like. In the research of a medical question-answering system, medical data can be trained by utilizing a deep learning technology, and then a complex network model is learned and constructed to solve some key problems in the question-answering process, such as named entity recognition of medical professional terms, classification of medical chief complaint texts and the like.

Disclosure of Invention

The invention provides a novel on-line medical question-answering method and system aiming at the problem that a medical question-answering system cannot provide proper answers when the problem cannot be accurately described by a patient who lacks professional knowledge in the prior art, and can provide accurate medical question-answering service for the patient in real time.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

an on-line medical question-answering method comprises the following steps,

(1) after receiving the questions of the patient, carrying out long and short sentence compression processing to obtain rewritten question sentences;

(2) performing word segmentation on the rewritten question sentence, and dividing the sentence into word sets;

(3) pre-training a neural language model, vectorizing the word set, and further vectorizing the question;

(4) extracting entity words related to disease symptoms;

(5) converting the extracted entity words into corresponding standard words;

(6) identifying the intention type of the patient according to the feature word bank;

(7) generating a question analysis result according to the standard words and the intention types of the patients;

(8) constructing a knowledge graph according to a medical disease knowledge base;

(9) and converting the question analysis result into a query sentence, and querying through a knowledge graph to obtain an answer.

Preferably, in the step (1), the patient question is received and then divided into a plurality of short sentences through punctuation, and the short sentences are further classified into sentences relevant to medical treatment and spoken words irrelevant to medical treatment;

and further classifying the phrases and adopting a TextCNN neural network model, wherein the TextCNN neural network model is constructed as follows:

an S1 input layer, which collects medical question and answer data on line from a medical website as a data set; labeling question sentences relevant to medical treatment into 1, labeling spoken sentences irrelevant to medical treatment into 0, dividing a data set into a training set and a testing set according to the proportion of 7:3, and converting training set data into embedding word vectors to be used as input of textCNN neural network model training;

s2, convolution layer, extracting the features of the imbedding word vector of the input layer through convolution operation, wherein each convolution kernel outputs a one-dimensional feature vector;

s3 pooling layer, pooling operation is carried out on the one-dimensional feature vector output by the convolutional layer, and abstract feature extraction is carried out; taking the maximum value of each one-dimensional feature vector, splicing the maximum values of all the feature vectors, and outputting the spliced one-dimensional feature vectors;

an S4 output layer, which maps the probability of each category corresponding to the question to (0, 1), thereby determining the category to which the question belongs according to the maximum probability;

the parameters of the TextCNN neural network model include training times and learning rate,

the model training times are iterative calculation times after the whole training set is input;

the learning rate is a coefficient calculated in a gradient descent algorithm adopted when the model carries out parameter updating;

after the construction of the TextCNN neural network model is completed, inputting the question rewritten in the step (2) into the constructed TextCNN neural network model to obtain the category to which the short sentence belongs, removing the short sentence with the category of 0, and reserving the question related to medical treatment;

performing rough word segmentation on patient condition chief information by using different symbols, and performing fine word segmentation by using a Chinese word segmentation tool jieba in combination with a custom disease symptom word bank to obtain a condition chief word set;

the different symbols include: any one or more of comma, colon, semicolon, &, percentile, equal sign and blank space;

the user-defined disease symptom word bank is a word bank constructed by disease symptom words obtained from a medical website;

the word segmentation tool jieba is combined with the disease symptom word stock in a mode that the word segmentation tool jieba provides an industry field word stock interface, and adds a self-defined disease symptom word stock to ensure that professional disease and symptom terms in the patient disease chief complaint information are not separated by errors;

the step (3) is specifically operated as follows: training a language model by using a word2vec algorithm, inputting a rewritten question, and outputting a vector with a word list size, wherein the value of each dimension of the vector is the probability of predicting the input of the next word based on the current input word; the acquired question and answer data of the patient on the medical website is used as a training data set of the model; setting parameters of the model, including training times and dimensionality of word vectors; after the word vector of each word is obtained, generating expression of the sentence vector, obtaining a word set according to the step (2), setting the number of words as n, obtaining the expression of the word vector of each word, setting the meaning word vectors as [ v1, v2, v3, …, vn ], and setting the sentence vector as s, obtaining expression of the sentence vector as;

(ii) a Wherein v is_nRepresents the value of each dimension of the vector,

the step (4) adopts any one of a deep learning algorithm, a rule matching algorithm based on a dictionary and a matching algorithm based on a template;

crawling professional disease symptom expression words and corresponding Synonyms from a knowledge base of a medical website, and carrying out duplicate removal on the same words, wherein the words with the same meaning but different expressions are subjected to synonymy combination by adopting a Python Chinese near-meaning word toolkit Synonyms to form a disease symptom knowledge base; keeping the entity words extracted in the step (4) as standard expression words, and calculating the rest non-standard words with the standard words in the knowledge base by adopting a similarity calculation method to obtain the standard word expression words with the highest similarity;

the step (6) defines feature word banks of different query types by an exhaustion method according to relevant knowledge in the vertical field of medical treatment, and matches the question of the patient by a character string matching algorithm to obtain the intention type of the patient;

the step (8) comprises a data collection stage, entity relation definition and knowledge graph construction, wherein the data collection stage is used for collecting medical record evaluation data and medical book data sources disclosed by medical websites, hundred-degree encyclopedias, authoritative medical institutions and research units; the knowledge graph is constructed by taking entities as nodes of the graph, taking entity relations as entity connecting edges in the graph and adopting a neo4j graph database to store the knowledge graph;

and (9) converting the analysis result of the question sentence of the patient obtained in the step (7) into a query language of a neo4j graph database, matching and searching in the graph stored in neo4j by using a match sentence of cypher, and assembling to form an answer according to data returned by query and returning the answer to the patient.

Preferably, the convolution kernel dimension of the convolution layer is convolution kernels with three different dimensions of 2 × 2, 3 × 3 and 4 × 4, and the number of convolution kernels in each dimension is 128; the output layer adopts any classification algorithm of softmax, sigmoid and svm as a classifier; softmax is calculated as

Wherein p is_iPredicted probability values for each department category, e^jThe value for each dimension of the vector is output for the softmax layer.

Preferably, the deep learning algorithm in the step (4) is a sequence labeling model BilSTM-CRF, the model is input as a patient consultation question expressed by a sentence vector, and output as a labeling result of a sentence; the first layer of the model is represented by vectors of consultation questions, the second layer of the model adopts a bidirectional LSTM neural network to extract time sequence characteristics of the questions, and the problems of overlarge data volume or over-slow efficiency are solved and replaced by a variant of the LSTM network; the third output layer of the model is a conditional random field layer, the features extracted by the LSTM are used as an observation sequence, and the maximum state sequence is output; and extracting the entity and the corresponding entity type according to the output state sequence and the complete entity BIO rule.

Preferably, the similarity calculation method in the step (5) adopts any one of word2vec, cosine similarity calculation and TF-IDF algorithm; and (6) matching the character strings by adopting any one of ahocorasick package of python and a forward or reverse maximum matching algorithm.

Compared with the prior art, the invention has the advantages and positive effects that:

the problem that the patient is not professional in self state expression and pain points which cannot be identified by a medical system can be effectively solved, the inquiry intention of the patient is accurately identified, and medicines, videos, treatment and the like recommended to the patient are given by the aid of a self-constructed knowledge graph according to the disease symptoms of the patient.

Drawings

FIG. 1 is a schematic diagram of the on-line medical question-answering process of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention may be more clearly understood, the present invention will be further described with reference to specific embodiments. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and thus the present invention is not limited to the specific embodiments of the present disclosure.

Example 1

The embodiment provides specific steps of an online medical question answering method, as shown in fig. 1:

1. preprocessing for receiving the patient's question, compressing the patient's question to obtain the rewritten question sentence

1.1, compressing a long difficult sentence, namely a sentence which is relatively complex and contains a plurality of spoken words irrelevant to medical treatment, input by a patient, into a medical consultation question sentence, and dividing the long difficult sentence into a plurality of short sentences through punctuations;

1.2 training a text classification model, and classifying the segmented short sentences into two classes, namely, a first class is a sentence related to medical treatment, and a second class is a spoken language sentence unrelated to medical treatment;

optionally, a TextCNN neural network may be used to classify the segmented short sentences into two categories, where the input of the network is vectorized representation of the question, and the output is the category to which the question belongs, which is described in detail below;

1.2.1, the first layer of the model is an input layer, the neural language model constructed in the step 3 is needed in the step, and the question sentence of the patient is converted into an embedding word vector, so that the training of the neural language model in the step 3 is implemented in advance, and the implementation steps are explained in detail below;

collecting medical question and answer data from various medical websites such as spring rain doctors, good doctors on line and the like as a data set of a TextCNN text classification model, for example, the collected data set is hundreds of thousands, wherein a question sentence related to medical treatment is labeled with 1, a spoken sentence unrelated to medical treatment is labeled with 0, the data set is divided into a training set and a test set according to the proportion of 7:3, and the training set data is converted into an embedding word vector and then is used as the input of TextCNN model training, namely the first layer of a neural network model;

1.2.2 the second layer of the model is a convolution layer, the characteristics of sentence vectors of the input layer are extracted through convolution operation, and each convolution kernel outputs a one-dimensional characteristic vector;

optionally, the convolution kernel dimension of the convolution layer may be 2 × 2, 3 × 3, 4 × 4 convolution kernels with three different dimensions, and the number of each convolution kernel may be 128;

1.2.3 the third layer of the model is a pooling layer, which performs pooling operation on sentence feature vectors output by the convolutional layer, wherein the operation is more abstract feature extraction on sentences;

optionally, a maximum pooling operation may be performed, that is, a maximum value of each feature vector is taken, and then all feature vectors are spliced after the maximum values are taken, so as to output one-dimensional feature vector;

1.2.4 the last layer of the model is an output layer;

optionally, a softmax classifier may be used as the output layer, the classifier maps the probability of each category corresponding to a question to (0, 1), so as to determine the category to which the question belongs according to the maximum probability, and besides softmax, sigmoid, svm classification algorithm and the like may be used as the classifier of the output layer;

the calculation mode of the softmax layer is

（1）

Wherein p is_iPredicted probability values for each department category, e^jOutputting a value of each dimension of the vector for the softmax layer; for calculating the probability of predicting a recommendation for a department.

1.2.5 setting parameters of the model, including training times and learning rate;

the learning rate is a coefficient calculated in a gradient descent algorithm (differentiation) employed when the model is subjected to parameter update.

1.2.6 the short sentences obtained after the segmentation of the question sentences of the current patient are respectively input into the constructed TextCNN neural network model to obtain the categories to which the short sentences belong, the short sentences with the category of 0 are removed, the question sentences relevant to medical treatment are reserved, and then the compression of long sentences into short sentences relevant to medical treatment only is completed;

it should be noted that in step 1.2.6, it is necessary to combine the key dictionary (disease knowledge base in step 8-1) of the medical field to ensure that the keywords in the question are preserved.

2. Performing word segmentation processing on the question of the patient, and segmenting the question into word sets;

2.1, performing word segmentation on the question of the patient obtained after the processing in the step (1), firstly performing coarse word segmentation on patient disease chief complaint information by using special symbols, and then performing fine word segmentation by using a Chinese word segmentation tool jieba in combination with a custom disease symptom word bank to obtain a disease chief complaint word set;

the special symbol includes: comma, colon, semicolon, &, percentile, equal sign and blank space;

the self-defined disease symptom word stock is a word stock constructed by disease symptom words obtained from a professional medical website; the medical website comprises: good doctor, spring rain doctor, quick doctor asking, 99 health net, and Baidu encyclopedia, some network published disease symptom data.

The jieba word segmentation tool is combined with the disease symptom word stock, provides an industry field word stock interface, adds a self-defined disease symptom word stock, and ensures that professional disease and symptom terms in the patient disease chief complaint information cannot be separated by errors.

3. According to a pre-trained neural language model, vectorizing a word set formed by segmenting a question of a patient, and further vectorizing a sentence;

3.1 optionally, training the language model with word2vec algorithm;

the word2vec is a neural network with a single hidden layer, the input of the neural network model is a sentence inquired by a patient, the output dimension is a vector with the size of a word list, and the value of each dimension of the vector is the probability of predicting the next word of the current input word;

3.2, similar to the step 1.2.1, using the obtained question and answer data of the patient on the medical website as a training data set of the model, wherein the adopted data is cleaned original text data;

optionally, the cleaning mode constructs a stop word list, and removes stop words in the sentence, such as words of 'hello', 'I', and the like;

3.3 setting parameters of the model, including training times and dimensionality of word vectors;

optionally, setting a word vector dimension as 300 dimensions, where the vector dimension may be determined according to the size of data, and may also be 50 dimensions, 200 dimensions, and the like;

3.4 generating the expression of sentence vectors after obtaining the word vectors of each word;

the sentence vector representation method is characterized in that a sentence vector representation is carried out on the patient consultation question by adopting a word addition and average method, and the dimensionality of the obtained sentence vector is 300 dimensions as well;

the addition and average method comprises the steps of firstly obtaining a word set after the patient question sentence is segmented according to the step 2, and setting the number of words asnObtaining the word vector representation of each word, and setting the word vectors as [ v ] respectively₁,v₂,v₃，…，v_n]Is provided withSentence vector ofsThen a sentence vector can be obtained as represented by

Wherein v is_nAnd representing the numerical value of each dimension of the vector, and numerically representing the question of the patient by adopting a sentence vector representation method for inputting the disease entity identification model.

4. Vectorizing the patient question by using the step 3, taking the vectorized patient question as the input of a pre-constructed disease entity identification model, and extracting entity words related to disease symptoms from the patient question;

4.1 optionally, extracting entities by adopting a deep learning algorithm, and besides, adopting a rule matching algorithm based on a dictionary, a matching algorithm based on a template and the like;

in the invention, the deep learning algorithm is a sequence labeling model BiLSTM-CRF, the input of the model is a patient consultation question expressed by a sentence vector, the output is a labeling result of a sentence, and the labeling types are diseases, symptoms and body parts;

optionally, entity labeling may be performed by using a label of BIO triplet, for example, the sentence "I has hypertension and feels chest stuffiness recently", and entity words such as "hypertension" and "chest stuffiness" may be extracted from this question, and the output labeling result is ("I": O, "having": O, "high": DISEASE-B, "blood": DISEASE-I, "pressure": DISEASE-I, "most": O, "near": O, "feeling": O, "to": O, "chest": SYMPTOM-B, "and" stuffy ": SYMPTOM-I), wherein DISEASE-B represents the first character of the DISEASE, DISEASE-I represents the non-first character of the DISEASE, and letter O represents a non-entity character, BIO triplet is the label used in the present invention, and in addition, the label of BIO triplet may be used;

the following steps specifically explain the construction method of the model;

4.2 the input layer of the model is synchronous, 1.2.1 is the same, the vector representation of the query sentence is used as input, the second layer of the model adopts a bidirectional LSTM neural network to extract the time sequence characteristics of the query sentence, the method has the advantages that the method can be used for capturing the past and future semantic characteristics of the sentence, and the number of LTSM network units can be flexibly set according to the requirement;

alternatively, if the data size is too large or the efficiency is too slow, a variant of the LSTM network, such as a bidirectional gated circular network BiGRU, may be used instead of the LSTM;

4.3 the third layer, the output layer, of the model is a Conditional Random Field (CRF) layer, which aims to output the maximum state sequence, namely the required labeling sequence, by taking the features extracted by the LSTM as an observation sequence;

and 4.4, extracting the entity and the corresponding entity type according to the output labeling result and the complete entity BIO rule.

4.5 converting the extracted disease symptom entity words into corresponding standard words according to the constructed synonym mapping table;

the method aims to convert the entity extracted in the step 4 into a standard word according to a pre-established standard knowledge base, wherein the knowledge base comprises professional expressions of disease symptoms and corresponding synonyms, and the method comprises the following specific steps:

5.1 crawling professional disease symptom expression words and corresponding synonyms from knowledge bases of all large medical websites, wherein the medical websites comprise spring rain doctors, good doctors, 39 health networks and other websites, such as acquired disease words "pneumonia", and the synonyms comprise "lower respiratory tract infection", "lung infectious diseases" and the like;

5.2, the crawled professional knowledge is fused, and the problems of repetition, same meanings and different expressions exist because professional knowledge such as disease symptoms is obtained from different source websites at the same time, so that the same words are de-duplicated, and the same meanings and different expressions are synonymously combined by adopting a Chinese near-meaning word toolkit Synonyms such as Python, so as to form an available disease symptom knowledge base;

5.3, the entity words extracted in the step 4 are reserved as standard statement words, and the non-standard words are calculated with the standard words in the knowledge base by adopting a similarity calculation method to obtain the standard statement words with the highest similarity;

the similarity calculation method can adopt the similarity calculation of word2vec, and optionally can also adopt cosine similarity calculation, TF-IDF algorithm and the like;

5.4 the standard disease symptom entity words in the patient query sentence can be obtained through the above steps.

6. Identifying the intention type of the patient according to the constructed feature word library of different question types in the medical vertical field;

the method aims at identifying the type of the consultation intention of a patient, wherein the type of the intention can be 'disease-related symptom consultation', 'disease medication consultation', 'eating/eating food consultation' and 'prevention consultation', and the like, and the method comprises the following specific steps:

6.1 because the medical question-answering field is a relatively professional field, the patient consultations are all information consultations developed by taking disease symptoms as the center, and the system has the characteristics of knowledge limitation and the like, so that according to the related knowledge in the medical vertical field, a feature word library of different consultation types can be defined by an exhaustive method, such as a query feature word library for defining the intention of 'symptoms' [ 'symptoms', 'representations', 'phenomena', 'manifestations', 'reactions', 'symptoms' … ], a query feature word library for defining the intention of 'symptom reasons' [ 'how' the result is, the reason ',' why 'how' the result is, the '…' is caused, and the like;

6.2 matching the patient question by adopting a character string matching algorithm according to the defined query type feature word bank to obtain the intention type of the patient;

alternatively, matching of strings can be realized by using the ahocorasick package of python, which essentially combines the trie algorithm and the Aho-Corasick automaton, and matching of multi-pattern strings is completed by the KMP algorithm, besides, algorithms such as forward or reverse maximum matching can be adopted.

7. Generating a question analysis result according to the disease symptom standard words and the intention type of the patient; the method aims to analyze the consultation question of the patient by combining the entity identified in the step 5 and the intention type of the patient acquired in the step 6, for example, the patient question is 'I has hypertension and asks for what can be recommended food', the entity 'hypertension' can be extracted through the step 5, the intention type of the patient can be acquired through the step 6, and therefore, the analysis result is generated to be 'hypertension' and 'food suitable for eating', namely, food suitable for eating for the patient with disease and hypertension is searched for, and therefore, the step 8 is carried out.

8. Constructing a knowledge graph according to a pre-constructed medical disease knowledge base; the method aims to construct a medical knowledge map for searching answers according to analyzed patient question results, and comprises the following specific steps:

8.1 data collection phase. Data can be collected from medical record evaluation data and medical books and other data sources disclosed by various medical websites, hundred-degree encyclopedias, authoritative medical institutions and research units, relational entities among the data need to be acquired at the same time in the step to serve as entity connection for constructing a knowledge graph, for example, in the disease encyclopedia of a medical inquiry website, relevant disease knowledge exists under a disease page, for example, hepatitis B is taken as an example, knowledge statistics of etiology, prevention, complications, symptoms, examination items, susceptible people, medicines and the like also exist under the disease page, and the relational entities and current disease entities are acquired and stored in a database when the data are acquired;

8.2 define entity relationships. The knowledge graph consists of entities and entity relations, each entity has respective attributes, and the entity relations are defined through the entity data obtained in the step (8-1), for example, the relation between a disease entity 'hepatitis B' and a department entity 'infectious medicine' is 'belonging', namely the disease hepatitis B can be treated by the infectious medicine, and the relation between the disease entity 'hepatitis B' and the supporting treatment 'is' treatment mode ', namely the disease entity' hepatitis B can be diagnosed and treated by the supporting treatment;

8.3 knowledge graph construction. According to the entity obtained in step 8.1 and the entity relationship defined in step 8.2, the entity is used as a node of the graph, the entity relationship is used as an entity connecting edge in the graph, and a neo4j graph database can be used for storing the knowledge graph.

9. And converting the question analysis result into a corresponding query sentence, and querying a corresponding answer according to the knowledge graph.

The specific implementation process comprises the following steps:

converting the analysis result of the patient question obtained in the step 7 into a query language of a neo4j graph database, wherein a match search is performed in a graph stored in neo4j by using a match sentence of cypher, and an answer is assembled and returned to the patient according to data returned by the query, for example, the patient question "what medicine can be taken by hypertension", and the returned answer "medicine which is suggested to be used by hypertension" includes: indapamide dripping pill, amlodipine besylate capsule;

it should be particularly noted that after the first round of inquiry by the patient, the patient may ignore the central entity of the inquiry in the following inquiry sentences, such as the first round of inquiry by the patient, "i has hypertension, and most recently feels chest distress and palpitation, and asks me what food should be paid attention to in daily diet is better," after the food recommended to the patient is returned by the inquiry and answering system, the patient may continue to inquire "what there is no food restriction", and know what the patient really wants to ask "what is not about hypertension", and at this time, the inquiry system should refer to the resolution processing, and "that" in the second round of inquiry actually refers to the entity of "hypertension", i.e. rewrite the inquiry sentence into "what is about how about hypertension is about, and then perform the second round of inquiry processing, optionally, the entity in the first round of inquiry may be saved in the memory, if no new entity is extracted from the second round of inquiry sentence, then the processing of referring to resolution is carried out, if a new entity is extracted, the question answering of the second round is directly carried out without processing.

Description and explanation of the terms in part:

1> Long question sentence

The position is in step 1.1, where the long question (long difficult sentence) means that the patient question is spoken comparatively and long and inconvenient for program processing, so that compression processing is adopted, the question beyond a number threshold (program setting is enough, and 25 characters are currently adopted as the threshold) is divided into a plurality of short sentences according to punctuation marks, the divided short sentences are classified in step 1.2, the spoken short sentences irrelevant to medical treatment are deleted, the short sentences containing the contents of patient diseases and the like are reserved, and the compressed patient question is finally formed.

2> convolution operation

The position is in step 1.2.2, the mathematical operation is performed, the calculation process is that a convolution layer (vector of 1 x N dimension) and a vector of an input layer (vectorization of question of patient) are moved from top to bottom according to the length of the convolution layer, each dimension of the input layer is multiplied by the corresponding dimension of the convolution layer, the multiplication results of the dimensions are added to obtain a convolution value, and the complete input layer is moved to obtain a vector (expressed as a characteristic vector) consisting of rolling machine values.

3> splicing

In step 1.2.3, the 1 x1 values are spliced into 1 x N-dimensional vector. The mathematical expression is: [ x1, x2, x3 … xn ], each xn being the value of 1 x1 for the pooled computation.

4> neural language model: the model takes word2vec algorithm published by Google team in 2013 as a training model, and (1) the model in step 3 has the function of vectorizing a patient question so that a computer program can process the question and simultaneously serve as input of the next step. (2) The model is a single hidden layer neural network structure, and is formally understood as an input layer, a hidden layer and an output layer, wherein the input layer is a numerical high-dimensional vector of a Chinese question of a patient, the hidden layer is a plurality of neuron arrays, and each neuron is used for differential operation between an output value and an input value, for example: if 100 iterative calculations are set, a parameter value is calculated by 100 derivations, the parameter value is multiplied by the input vector to obtain the numerical output of the output layer, the output layer is also a high-dimensional vector, the dimension and the dimension represented by the size of the vocabulary in the point (3) are the numerical vector representing the question of the patient. (3) Vocabulary sources used in model training: in the step 1-2-1, the patient consults online patient consultation data acquired from various medical websites such as spring rain doctors, good doctors and the like as a data set, each piece of data is subjected to word segmentation and duplication removal according to a Chinese word segmentation tool jieba (multiplexing the existing tools) to obtain a word list, the word list has the function of ordering labels according to 1-N, and questions are input to the patient, so that numerical conversion can be carried out according to the position of the word list.

5> feature word bank

Position at step 6, can be constructed by: (1) disease knowledge bases in medical websites such as good doctors; (2) in the data set described in 1-2-1, which will be described later, disease inquiry and disease terminology are extracted based on the existing medical knowledge. The two are combined to construct a characteristic word bank, and the content of the word bank is as follows: "symptom", "characterization", "response", and the like. Each word is called a feature word, and the thesaurus is used for matching the feature words in the patient question so that the program processing can output the inquiry intention of the patient, wherein the matching mode is a character string matching method which is specified in step 6-2.

6 medical knowledge base for diseases

Position at step 8, can be constructed by: as described in step 8-1, disease and symptom terminology stored in the form of web crawlers (python language implementation), manual records, etc. from various medical websites (spring rain doctors, good doctors, etc.), encyclopedia, etc. is used as a knowledge base to be used.

In the embodiment, the parts not specifically described are all implemented by using a model or algorithm commonly used in the computer field.

The above description is only a preferred embodiment of the present invention, and not intended to limit the present invention in other forms, and any person skilled in the art may apply the above modifications or changes to the equivalent embodiments with equivalent changes, without departing from the technical spirit of the present invention, and any simple modification, equivalent change and change made to the above embodiments according to the technical spirit of the present invention still belong to the protection scope of the technical spirit of the present invention.

Claims

1. An on-line medical question-answering method is characterized in that: the steps are as follows,

(4) extracting entity words related to disease symptoms;

(5) converting the extracted entity words into corresponding standard words;

2. The online medical question-answering method according to claim 1, characterized in that: after the patient problem in the step (1) is received, dividing the patient problem into a plurality of short sentences through punctuations, and further classifying the short sentences into sentences relevant to medical treatment and spoken language sentences irrelevant to medical treatment;

the step (3) is specifically operated as follows: training a language model by using a word2vec algorithm, inputting a rewritten question, and outputting a vector with a word list size, wherein the value of each dimension of the vector is the probability of predicting the input of the next word based on the current input word; the acquired question and answer data of the patient on the medical website is used as a training data set of the model; setting parameters of the model, including training times and dimensionality of word vectors; after obtaining the word vector of each word, generating a sentence vector, obtaining a word set according to the step (2), wherein the number of the words is recorded as n, and the word vectors are respectively defined as v₁,v₂,v₃，…，v_n]If the sentence vector is s, the sentence vector is obtained;

(ii) a Wherein v is_nRepresents the value of each dimension of the vector,

crawling professional disease symptom expression words and corresponding Synonyms from a knowledge base of a medical website, and carrying out duplicate removal on the same words, wherein the words with the same meaning but different expressions are subjected to synonymy combination by adopting a Python Chinese near-meaning word toolkit Synonyms to form a disease symptom knowledge base; reserving the entity words extracted in the step (4) as standard expression words, and calculating the rest non-standard words by adopting a similarity calculation method and standard words to obtain standard word expressions with the highest similarity;

the step (6) defines feature word libraries of different question types by an exhaustion method, and adopts a character string matching algorithm for matching to obtain the intention types of the patients;

the step (8) comprises data collection, entity relation definition and knowledge graph construction, wherein the data collection is carried out through medical record evaluation data and medical book data sources disclosed by medical websites, hundred-degree encyclopedias, medical institutions and research units; the knowledge graph is constructed by taking entities as nodes of the graph, taking entity relations as entity connecting edges in the graph and adopting a neo4j graph database for graph storage;

and (9) converting the question parsing result obtained in the step (7) into a query language of a neo4j graph database, matching and searching in a graph stored in neo4j by using a match statement of cypher, and assembling to form an answer according to data returned by query.

3. The online medical question-answering method according to claim 2, characterized in that: convolution kernels of the convolution layers are convolution kernels with three dimensions of 2 x2, 3 x3 and 4 x 4, and the number of the convolution kernels in each dimension is 128; the output layer is carried out by adopting any classification algorithm of softmax, sigmoid and svm; is calculated by the formula

Wherein p is_iPredicted probability values for each department category, e^jIs the value of each dimension of the output vector.

4. The online medical question-answering method according to claim 2, characterized in that: the deep learning algorithm in the step (4) is a sequence labeling model BiLSTM-CRF, a patient consultation question expressed by a sentence vector is input, and a labeling result is output; the first layer of input layer is expressed by the vector of the consultation question, the second layer adopts a bidirectional LSTM neural network to extract the time sequence characteristics of the question, the problem of overlarge data volume or over-slow efficiency is solved, and the problem is replaced by the variant of the LSTM network; the third output layer is a conditional random field layer, the features extracted by the LSTM are used as an observation sequence, and the maximum state sequence is output; and extracting the entity and the corresponding entity type according to the output state sequence and the complete entity BIO rule.

5. The on-line medical question-answering method according to claim 2, wherein the similarity calculation method in the step (5) adopts any one of word2vec, cosine similarity calculation, and TF-IDF algorithm; and (6) matching the character strings by adopting any one of ahocorasick package of python and a forward or reverse maximum matching algorithm.