CN113284627A

CN113284627A - Medication recommendation method based on patient characterization learning

Info

Publication number: CN113284627A
Application number: CN202110406631.3A
Authority: CN
Inventors: 朱振峰; 徐慕豪; 刘俊秀; 葛欣宜
Original assignee: Beijing Jiaotong University; Peking University Third Hospital Peking University Third Clinical Medical College
Current assignee: Beijing Jiaotong University; Peking University Third Hospital Peking University Third Clinical Medical College
Priority date: 2021-04-15
Filing date: 2021-04-15
Publication date: 2021-08-20
Anticipated expiration: 2041-04-15
Also published as: CN113284627B

Abstract

The invention provides a medication recommendation method based on patient characterization learning, which comprises the following steps: extracting data from the electronic medical record, and expressing unstructured complaint text information in the data as structured data; performing characterization learning on the structured data by adopting a stack sparse self-encoder to obtain a low-dimensional expression symptom vector of patient symptom data and a low-dimensional expression medication vector of medication information data; analyzing the low-dimensional representation of the patient symptom data and the low-dimensional representation of the medication information data by using a clustering algorithm to obtain the symptom characteristics and the medication characteristics of the patients in each group cluster; performing typical correlation analysis on the symptom characteristics and the medication characteristics of the patients in each group cluster to obtain the incidence relation between the symptom characteristics and the medication characteristics of the patients in each group cluster; and predicting the recommended medication by adopting a weighted distance average K neighbor algorithm according to the incidence relation. The method can accurately recommend the medicine for the patient according to the electronic medical record, and improves the working efficiency of doctors.

Description

Medication recommendation method based on patient characterization learning

Technical Field

The invention relates to the technical field of medical informatization, in particular to a medication recommendation method based on patient characterization learning.

Background

In recent years, with the continuous development of computers and information technology, the medical information industry of China is gradually built and perfected, wherein the treatment records of patients are changed from original paper materials to digital electronic medical records. Compared with other countries in the world, China starts to build electronic medical records later. However, as the medical and health system is receiving more attention, in recent years, the government of China has developed a plurality of policies to support the construction and development of medical informatization.

The electronic medical record is important data information in medical informatization, and the electronic medical record covers a large amount of medical information and health information in all medical activities of patients to see a doctor, so that the electronic medical record has great research significance. First, for a patient, mining information in an electronic medical record helps the patient to develop his or her own health. The past diagnosis information and health condition of the patient are recorded in the electronic medical record, and if the data information in the records can be extracted and analyzed, certain reference and prediction can be provided for the physical condition and health information of the patient. Meanwhile, other similar patients can be found in the big data by analyzing and mining the electronic medical record data of the patients, and the condition information of the patients with similar symptoms is used for providing reference for the patients; secondly, for the doctor, the medical efficiency can be improved by mining the information in the electronic medical record. The computer processes a large number of electronic medical records through methods such as natural language processing, machine learning and the like, and particularly can assist medical staff in completing diagnosis and treatment of patients through text information in the medical records, so that the decision-making capability of doctors and the treatment efficiency of the patients are improved.

The electronic medical record records not only structured data, but also a large amount of unstructured image, signal and text information, and the unstructured data contains the most precious information in the electronic medical record. Current medication recommendation systems are generally limited to the use of numerical and structured data in patient electronic medical record data. However, the medicine recommendation is performed only by the structured data, so that the medicine taking accuracy is low, and the individual medicine taking requirements of patients are difficult to meet. In addition, the traditional manual data feature extraction method not only consumes a great deal of manpower, but also has higher requirements on professional knowledge.

Therefore, a method for recommending medication to a patient aiming at the problem of insufficient usage of unstructured information is needed.

Disclosure of Invention

The invention provides a medication recommendation method based on patient characterization learning, which aims to overcome the defects in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme.

A medication recommendation method based on patient characterization learning, comprising:

extracting data from the electronic medical record, wherein the data comprises unstructured complaint text information and structured data;

representing unstructured complaint text information in the data as structured data;

performing characterization learning on the structured data by adopting a stack sparse self-encoder to obtain a low-dimensional expression symptom vector of patient symptom data and a low-dimensional expression medication vector of medication information data;

analyzing the low-dimensional representation of the patient symptom data and the low-dimensional representation of the medication information data by using a clustering algorithm to obtain the symptom characteristics and the medication characteristics of the patients in each group cluster;

performing typical correlation analysis on the symptom characteristics and the medication characteristics of the patients in each group cluster to obtain the incidence relation between the symptom characteristics and the medication characteristics of the patients in each group cluster;

and predicting recommended medication by adopting a weighted distance average K nearest neighbor algorithm according to the incidence relation.

Preferably, the method for representing unstructured complaint text information in the data as structured data comprises the following steps:

performing word segmentation on the unstructured complaint text information: processing the unstructured main complaint text information based on a word segmentation tool, calculating mutual information values among words, identifying fixed matched words in the main complaint text according to the mutual information values, and constructing a self-defined dictionary so as to complete word segmentation work of the main complaint;

and extracting information of the result after word segmentation: and comparing the words after word segmentation processing with standard texts in a symptom library of a hospital one by one, if the words are matched, directly finishing extraction work, and if the words are not matched, searching the symptom word texts corresponding to the words after word segmentation based on a word similarity calculation method of a search engine to obtain structured data.

Preferably, the symptom word text corresponding to the word after word segmentation is searched by the word similarity calculation method based on the search engine, and the method comprises the following specific operation steps:

for the processed word p and the symptom bank standard text q,

q is a set of related p texts in a symptom library, and the number of search results returned by the page when two words are searched respectively and simultaneously is obtained by using a crawler and is recorded as N (p), N (Q) and N (p ^ Q);

similarity of word phases is calculated according to the following formula (1):

and sequentially calculating the similarity of the word p and all standard texts in the word Q, and if the corresponding maximum similarity exceeds a first set threshold value, putting the word p in the corresponding standard texts to obtain structured data.

Preferably, the stacked sparse autoencoder is formed by connecting two layers of simple autoencoders, and the hidden layer dimensions of the two layers of simple autoencoders are 8 dimensions and 4 dimensions respectively.

Preferably, the characterization learning of the structured data is performed by using a stack sparse self-encoder, and comprises the following steps:

the loss function of the stacked sparse self-encoder is mean square error, and the sparsity limit is introduced by adding an L2 regularization term, and the formula is shown as the following formula (2):

the hidden layer activation function adopts a Relu function shown in the following formula (3):

f(x)＝max(0,x) (3)

the reconstructed layer activation function is a Softplus function represented by the following formula (4):

f(x)＝log(1+e^x) (4)

where J is the loss function of the model, x_iIs the ith vector of the input model, N is the number of input data, f and g are the deep neural networks of the encoding stage and the decoding stage in the self-encoder, respectively, α is the regularization coefficient, and w is each parameter in the model.

Preferably, the low dimensional representation of the patient symptom data and the low dimensional representation of the medication information data are analyzed using a clustering algorithm to obtain the symptom characteristic and the medication characteristic of the patient within each population cluster, including:

taking the sum of squared errors SSE as a core index, taking symptom vectors of all patients as a training set, and obtaining the optimal clustering number by using a heuristic elbow rule;

combining symptom vectors expressed by patient symptom data in a low-dimensional manner and medication vectors expressed by medication information data in a low-dimensional manner to form a combined vector, taking the combined vector of all patients as a cluster, dividing the combined vector into two clusters by using a K-Means clustering algorithm, calculating SSE values of the two clusters, and continuously dividing a large cluster in the SSE values corresponding to the two clusters into the two clusters by using the K-Means algorithm until the optimal cluster number is reached;

and counting the obtained original data information of the patients in each cluster group to obtain the symptom characteristics and the corresponding medication characteristics of the patients in each cluster group.

Preferably, the method uses a heuristic elbow rule to obtain the optimal cluster number by taking the sum of squared errors as a core index and the symptom vectors of all patients as a training set, and specifically comprises the following steps:

clustering symptom vectors of all patients and setting different cluster numbers according to the error square sum of the following formula (5) as a core index, calculating an SSE value obtained by taking the symptom vector of each patient as a sample point, respectively drawing a relation graph of the SSE value and the cluster number, and observing the elbow of a curve, namely the cluster number corresponding to the highest curvature position, as an optimal cluster number;

where u is the selected sample point, C is the respective cluster set of the cluster partitions, C is the number of clusters in the cluster partition_iDenotes the ith cluster, m_iIs C_iAverage of all samples in (1).

Preferably, the typical correlation analysis is performed on the symptom characteristics and the medication characteristics of the patients in each population cluster to obtain the association relationship between the symptom characteristics and the medication characteristics of the patients in each population cluster, and the association relationship comprises:

sample set X belonging to symptom characteristics in same group cluster and belonging to R^r×nAnd a sample set of drug administration characteristics Y ∈ R^s×nNormalizing the data to have a mean of 0 and a variance of 1, wherein r and s represent the dimensions of each symptom characteristic and each medication characteristic, respectively;

selecting a plurality of sets of linearly uncorrelated projection vectors in two sample sets, and respectively determining the vector a in each set to be the R^rAnd b ∈ R^sProjecting X and Y onto X ' and Y ', respectively, i.e. X ' ═ a^TX，Y′＝b^TY; optimizing the objective so that XSolving the constraint optimization problem of the maximum correlation coefficient rho according to the Lagrangian function shown in the following formula (6) to obtain a plurality of groups of linear combinations and corresponding correlation coefficients as the correlation relationship between the symptom characteristics and the medication characteristics of the patients in each group cluster, wherein X' belongs to R^v×n,Y′∈R^v×nAnd v is the number of linear combinations:

wherein S is_XY＝cov(X,Y)。

Preferably, according to the association relationship, a weighted distance average K-nearest neighbor algorithm is used to predict recommended medication, including:

finding distances of other K groups of projection vectors adjacent in the cluster to which the sample belongs by using a K nearest neighbor algorithm according to a distance calculation formula of the following formula (7):

obtaining and validating patient sample x_aAdjacent k sets of projection vectors X' ═ { X ═ X₁′，x₂′，...，x_k′}，X′∈R^v×kV is the number of linear combinations which are the dimensionality of a typical correlation vector, a group of original complaint data X and medication data Y which are not learned by a self-encoder are obtained through X', and the average medication value is calculated according to the following formula (8) through the nearest neighbor reverse distance weighting

According to the average dosage

The relationship with a second set threshold determines whether the medication data Y is recommended for use:

wherein,

indicating that the sum of the correlation coefficients after processing is 1,

weight, D, representing the inverse distance weighting of the ith projection vector_iIs a patient sample x_aDistance from the ith projection vector, Y_iRepresenting the ith raw medication data.

Preferably, the average dosage is based on the average dosage

Determining whether the medication data Y is recommended in relation to a second set threshold, including: interpreting the mean value of the medication

And if the second set threshold is larger than the second set threshold, recommending the medication data Y to be used and setting the medication result to be 1, if not, not recommending the medication data Y to be used and setting the medication result to be 0, wherein the second set threshold is acquired through training of a training set.

The technical scheme provided by the medicine recommendation method based on patient characterization learning can be seen that the method carries out feature extraction, structuralization and dimension reduction on electronic medical record data, analyzes the expression of patient symptoms and medicine information by using a clustering algorithm, establishes the incidence relation between chief complaint symptoms and medicine situations by using typical correlation analysis, adopts a weighted distance average K neighbor algorithm to predict medicine recommendation for patients, and provides reference opinions for doctors during diagnosis.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of a medication recommendation method based on patient characterization learning according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a medication recommendation method based on patient characterization learning according to this embodiment;

FIG. 3 is a schematic diagram illustrating a medication recommendation method based on patient characterization learning according to this embodiment;

FIG. 4 is a diagram illustrating a structure of a stacked sparse self-encoder according to the present embodiment;

FIG. 5 is a schematic diagram of a sample information extraction;

FIG. 6 is a graph of symptom characteristics and medication characteristics results for patient clusters;

FIG. 7 is a graph showing the results of the method on the allergic rhinitis test set of the accuracy rate, precision rate and f1 value varying with the threshold value.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the convenience of understanding of the embodiments of the present invention, the following description will be further explained by taking specific embodiments as examples with reference to the drawings, and the embodiments of the present invention are not limited thereto.

Examples

Fig. 1 is a schematic flow chart of a medication recommendation method based on patient characterization learning according to an embodiment of the present invention, fig. 2 and fig. 3 are schematic diagrams of a medication recommendation method based on patient characterization learning according to an embodiment of the present invention, and referring to fig. 1, fig. 2 and fig. 3, the method includes the following steps:

s1 extracts data from the electronic medical record.

The extracted data includes unstructured complaint text information and structured data.

S2 represents the unstructured complaint text information in the data as structured data.

Since the electronic medical record contains a large amount of unstructured text information, in order to mine the value of the unstructured text information, the unstructured text information needs to be converted into structured data which can be utilized by a computer.

There are now more sophisticated segmentation tools in the field of Chinese natural language processing, such as Jieba Chinese segmentation ("Jieba" Chinese segmentation), THULAC (THU Lexical Analyzer for Chinese, Qinghua university) and ICTCCLAS (Institute of Computing Technology, Chinese Lexical Analysis System, Chinese academy of sciences). However, these general word segmentation tools cannot segment the medical complaint text accurately because some medical words are professional. The content of the word segmentation of the unstructured complaint text information in this embodiment specifically includes: processing unstructured main complaint text information based on a word segmentation tool, calculating mutual information values among words, identifying fixed matched words in the main complaint text according to the mutual information values, and constructing a custom dictionary, thereby completing word segmentation work of the main complaint. Specifically, after removing data missing from the chief complaint information in the original data set, in this embodiment, firstly, a basic Jieba word segmentation tool is used to process the chief complaint information therein, an accurate mode is selected, and the text is most accurately segmented; then, the mutual information value of each adjacent word is calculated, the words are sorted from big to small, a proper fixed collocation new word is selected by setting a threshold value, the new word is added into a user-defined dictionary, and the main complaint text is segmented again. The operation process is as follows:

data cleaning is carried out on an original electronic medical record data set, word segmentation is carried out on the main complaint text information by using a Jieba word segmentation tool, and each section of main complaint text forms a plurality of words to form a list;

constructing nodes for the obtained word segmentation results by using a 2-gram model (binary model), storing the word segmentation results by using a Trie tree (dictionary tree) and counting word frequency;

for two words X and Y, P (X), P (Y) are probabilities of two words respectively, and P (X, Y) is a probability of two words appearing adjacently, a mutual information value is calculated according to the following formula (1):

and sequencing the obtained mutual information values from large to small, setting a threshold value and selecting a proper fixed collocation new word. And (4) constructing a custom dictionary by using the new words, and segmenting the main complaint text again.

In this embodiment, the search for the symptom word text corresponding to the segmented word by using the word similarity calculation method based on the search engine is performed by searching for the symptom word text corresponding to the segmented word by using the word similarity calculation method based on the Baidu search, and the specific operation steps are as follows:

for the processed word p and the symptom bank standard text q,

similarity of word phases is calculated according to the following formula (2):

S3, the structured data is subjected to characterization learning by adopting a stack sparse self-encoder, and a symptom vector represented by a low dimension of the patient symptom data and a medication vector represented by a low dimension of the medication information data are obtained.

The loss function of the stacked sparse self-encoder is mean square error, and the sparsity limit is introduced by adding an L2 regularization term, and the formula is shown as the following formula (3):

the hidden layer activation function adopts a Relu function shown in the following formula (4):

f(x)＝max(0,x) (4)

the reconstructed layer activation function is a Softplus function represented by the following formula (5):

f(x)＝log(1+e^x) (5)

The self-encoder reconstructs original input data through the learning of the hidden layer and learns the compressed low-dimensional representation of the original input data, so that the error between the input data and the output data can be reduced to the maximum extent. The stack sparse self-encoder is a model for improving a self-encoder, and a plurality of self-encoder models are stacked to perform information learning layer by layer, so that more complex codes and deep features of original input data can be learned. And a regularization term is added for inhibiting neurons and preventing the phenomena of network over-memory and over-fitting. Fig. 4 is a schematic structural diagram of the stacked sparse self-encoder of this embodiment, and referring to fig. 4, the stacked sparse self-encoder is formed by connecting two layers of simple self-encoders, and the hidden layer dimensions of the two layers of simple self-encoders are 8 dimensions and 4 dimensions, respectively.

S4, the low-dimensional representation of the patient symptom data and the low-dimensional representation of the medication information data are analyzed by using a clustering algorithm, and the symptom characteristics and the medication characteristics of the patients in each group cluster are obtained.

After the self-encoder processing, the low-dimensional symptom data and the low-dimensional medication data are subjected to cluster analysis, and the clustering result is used for depicting the patient group portrait. And analyzing the original data information of the patients in each group cluster according to the clustered result, wherein the main characteristic of the analysis is the symptom characteristic and the medication characteristic of the patients in each group cluster after clustering, and the group patient image also provides reference and basis for the follow-up medication recommendation.

The method comprises the following steps of taking Sum of Squared Errors (SSE) as a core index, taking symptom vectors of all patients as a training set, and obtaining the optimal cluster number by using a heuristic elbow rule, wherein the method specifically comprises the following steps:

clustering symptom vectors of all patients and setting different cluster numbers according to the error square sum of the following formula (6) as a core index, calculating an SSE value obtained by taking the symptom vector of each patient as a sample point, respectively drawing a relation graph of the SSE value and the cluster number, and observing the elbow of a curve, namely the cluster number corresponding to the highest curvature position, as an optimal cluster number;

S5, carrying out typical correlation analysis on the symptom characteristics and the medication characteristics of the patients in each group cluster to obtain the incidence relation between the symptom characteristics and the medication characteristics of the patients in each group cluster.

The method specifically comprises the following steps:

characterised by symptoms within the same population clusterSample set X is belonged to R^r×nAnd a sample set of drug administration characteristics Y ∈ R^s×nNormalizing the data to have a mean of 0 and a variance of 1, wherein r and s represent the dimensions of each symptom characteristic and each medication characteristic, respectively;

selecting a plurality of sets of linearly uncorrelated projection vectors in two sample sets, and respectively determining the vector a in each set to be the R^rAnd b ∈ R^sProjecting X and Y onto X ' and Y ', respectively, i.e. X ' ═ a^TX，Y′＝b^TY; optimizing the target to maximize the correlation coefficient rho of X 'and Y', let S_XYCov (X, Y), the criteria function can be written as:

observing the formula, the result does not change when the denominator of the numerator is increased by the same factor at the same time, so it is converted into optimizing the numerator when the value of the denominator is fixed. The specific formula is shown in the following formula (7):

maxa^TS_XYb

s.t.a^TS_XXa＝1，b^TS_YYb＝1 (7)

obtaining a constraint optimization problem of solving the maximum correlation coefficient rho according to a Lagrangian function shown in the following formula (8), obtaining a plurality of groups of linear combinations and corresponding correlation coefficients as the incidence relation between the symptom characteristics and the medication characteristics of the patients in each group of clusters, wherein X' belongs to R^v×n,Y′∈R^v×nAnd v is the number of linear combinations:

wherein S is_XY＝cov(X,Y)。

S6, according to the incidence relation, the K neighbor algorithm of weighted distance average is adopted to predict the recommended medication.

Finding distances of other K groups of projection vectors adjacent in the cluster to which the sample belongs by using a K nearest neighbor algorithm according to a distance calculation formula of the following formula (9):

k sets of projection vectors X' adjacent to the patient sample X to be confirmed are obtained, X ═ X₁′，x₂′，...，x_k′}，X′∈R^v ^×kV is the number of linear combinations which are the dimensionality of a typical correlation vector, a group of original complaint data X and medication data Y which are not learned by a self-encoder are obtained through X', and the medication average value is calculated according to the following formula (10) by the nearest neighbor reverse distance weighting

According to the average dosage

The relationship with the second set threshold determines whether or not the medication data Y is recommended to be used:

wherein,

indicating that the sum of the correlation coefficients after processing is 1,

The specific judging steps are as follows: interpreting the mean value of the medication

Whether the second set threshold is larger than a second set threshold or not, if so, recommending the medication data Y to be used, setting the medication result to be 1, if not, not recommending the medication data Y to be used, setting the medication result to be 0, and acquiring the second set threshold through training of a training set。

The following concrete examples of the real data for diagnosing allergic rhinitis in otorhinolaryngology clinics department of a certain Beijing hospital are as follows:

the electronic medical record used in the embodiment is derived from real data of allergic rhinitis diagnosed in otorhinolaryngology clinics department of a certain Beijing hospital, and consists of three major parts, namely basic information, chief complaint information and outpatient information. Wherein the basic information includes the patient's number, number of visits, and patient's gender. The chief complaint information is a refined summary made by doctors according to symptoms, physical signs and properties of patients, duration and mild and severe conditions; the outpatient service information comprises information of the visit time, the diagnosis made by the doctor and the medication advice; because allergic rhinitis is a common respiratory disease, it is usually diagnosed by the symptoms and history of the patient, and in rare cases allergen detection is possible. Therefore, in the electronic medical record data set, the most concerned is the main complaint text information which contains the key information of the doctor for making diagnosis and ordering medication for the patient.

The data set is considered to be the information of patients diagnosed with allergic rhinitis, and the chief complaint information of the data set is relatively fixed in the use of medical words. After the general word segmentation tool processes the words, the mutual information value among the words is calculated, the fixed matching words in the main complaint text are identified according to the mutual information value, and a self-defined dictionary is constructed, so that the word segmentation work of the main complaint is completed.

After the word segmentation is completed, the main complaint text takes 16 types of symptoms given by the hospital as standard texts to extract information in the main complaint text. The chief complaint information is a refined description of the patient symptom information by doctors and has strong medical speciality, but because of different habits of different doctors, the chief complaint information has certain difference in description even for the same thing, and may be slightly different in words or completely different in expression. For example, "epistaxis" and "nasal bleeding" have the same meaning, and "thin nasal discharge", "white nasal discharge" and "watery nasal discharge" have the same meaning. In order to extract information in the main complaints more accurately and comprehensively, words with similar meanings are found out through similarity calculation and classified. The method researches Chinese words, so that a word similarity calculation method of hundred-degree search is used, a network is used as a corpus updated in real time, and the relevance of word pairs is emphasized. The method mainly uses the number of query results obtained by a search engine.

Taking 16 types of symptoms given by a hospital party as standard texts, respectively: nasal obstruction, watery nasal discharge, purulent nasal discharge, watery nasal discharge, nasal itching, nasal dryness, sneezing, postnasal operation, headache, nasal hemorrhage, hyposmia, common cold, ear itching, dripping and leaking after nose, bloody nasal discharge, and itchy eyes. When information is extracted, the results of the word segmentation of the main complaint text are compared one by one, and if the word is matched with the 16-class symptom standard, the extraction work is directly finished; if the word is not matched with the symptom standard, the similarity between the word and the 16 types of symptom standards is calculated in sequence, and if the maximum similarity exceeds a set threshold value, the word is classified into the corresponding standard. An example of information extraction is shown in fig. 5. The patient's symptom information and medication information are all converted into structured information represented by numerical values of 0 or 1, wherein a total of 3731 pieces of patient information each containing 16-dimensional symptom information and 23-dimensional medication information.

All the structured information of complaint symptoms and medications were scored as 8: 2, the training set data is subjected to representation learning by adopting a stack sparse autoencoder to obtain a symptom vector expressed by the low dimension of the patient symptom data and a medication vector expressed by the low dimension of the medication information data, wherein the symptom vector of each patient is set to be 4 dimensions, and the medication vector is set to be 5 dimensions. Clustering analysis is carried out on low-dimensional symptom data and low-dimensional medication data, a patient population image is depicted by using a clustering result, 3 types of patient populations are obtained in total, and the symptom characteristics and the medication characteristics of 3 clustering clusters are shown in figure 6. And finally, performing typical correlation analysis on the patient symptom data and the medication data processed by the self-encoder to obtain the correlation between the symptom characteristic and the medication characteristic of the patient, and taking the first 3 pairs of typical correlation variables to form a 3-dimensional typical correlation vector.

Medication recommendations were made using the test set. The main complaints of one patient are described as intermittent bilateral nasal obstruction, clear watery nasal discharge, nasal itching, continuous sneezing and hyposmia for several months, and the values of the nasal obstruction, the clear watery nasal discharge, the nasal itching, the sneezing and the hyposmia in the symptoms are set to be 1 through information extraction, and the values of the other symptoms are 0. The medication data of the patient are Renokote, Compound Xanthium sibiricum tablets, cis-Er Ning and Aiseiping. The symptom data is processed by an autoencoder to obtain a symptom vector represented in a low dimension, where the calculation results are (1.850080848, 2.189118862, 3.87420392, 2.021232367). And calculating a typical correlation vector which is (-3.701395327, -0.676103465, -3.337884659) through the projection vector obtained on the training set, finding K groups of symptom vectors adjacent to the typical correlation vector in the result of the training set by using a K-nearest neighbor algorithm, and further obtaining K groups of original sample information, wherein K is set to be 24. The average of the 24 groups of the administration data was calculated based on the nearest neighbor inverse distance weighting, and the calculation result was (0.6, 0.067, 1, 0, 0.2, 0.2, 0.533, 0, 0, 0, 0.133, 0.2, 0.4, 0, 0, 0, 0, 0, 0.067) as the administration data of the test sample. And if the score of the medicine is larger than or equal to the set threshold, recommending the medicine to be used, and setting the medicine using result to be 1, otherwise, not recommending the medicine to be used, and setting the medicine using result to be 0. The threshold value is taken to be 0.4, and four medicaments meet the result, namely the recommended results are 'Renokott', 'Compound Xanthium sibiricum', 'cis-Er-Ning' and 'Aisaiping', which are consistent with the real situation.

The results of comparing the recommendation results with the real results and obtaining the change of the accuracy acc, the accuracy p and the f1 values on the test set along with the threshold values are shown in fig. 7, when the threshold value is 0.5, the accuracy of the test set is 90.15%, the accuracy is 84.86%, and the f1 value is 0.4823, which shows that the method can accurately recommend most of the medicines for the allergic rhinitis patients.

From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A medication recommendation method based on patient characterization learning, comprising:

2. The method according to claim 1, wherein the representing unstructured complaint text information in the data as structured data comprises:

3. The method according to claim 2, wherein the search for the symptom word text corresponding to the participled word by the search engine-based word similarity calculation method is a search for the symptom word text corresponding to the participled word by the Baidu search-based word similarity calculation method, and the specific operation steps are as follows:

for the processed word p and the symptom bank standard text q,

similarity of word phases is calculated according to the following formula (1):

4. The method according to claim 1, wherein the stacked sparse self-encoder is formed by connecting two layers of simple self-encoders, and hidden layer dimensions of the two layers of simple self-encoders are 8 dimensions and 4 dimensions respectively.

5. The method of claim 4, wherein the learning of the characterization of the structured data by using the stacked sparse self-encoder comprises:

f(x)＝max(0,x) (3)

f(x)＝log(1+e^x) (4)

6. The method of claim 1, wherein analyzing the low dimensional representation of the patient symptom data and the low dimensional representation of the medication information data using a clustering algorithm to obtain the symptom characteristic and the medication characteristic of the patient within each cluster of the population comprises:

7. The method according to claim 6, wherein the method uses a heuristic elbow rule to obtain the optimal cluster number by using the sum of squared errors as a core index and the symptom vectors of all patients as a training set, and specifically comprises:

8. The method of claim 1, wherein the performing canonical correlation analysis on the symptom characteristic and the medication characteristic of the patients in each cluster of groups to obtain the association between the symptom characteristic and the medication characteristic of the patients in each cluster of groups comprises:

selecting a plurality of sets of linearly uncorrelated projection vectors in two sample sets, and respectively determining the vector a in each set to be the R^rAnd b ∈ R^sProjecting X and Y onto X ' and Y ', respectively, i.e. X ' ═ a^TX，Y′＝b^TY; the optimization target enables the correlation coefficient rho of the X ' and the Y ' to be maximum, the constraint optimization problem of the maximum correlation coefficient rho is solved according to the Lagrangian function shown in the following formula (6), a plurality of groups of linear combinations and corresponding correlation coefficients are obtained and used as the incidence relation between the symptom characteristic and the medication characteristic of the patients in each group cluster, and at the moment, X ' belongs to R^v×n,Y′∈R^v×nAnd v is the number of linear combinations:

wherein S is_XY＝cov(X,Y)。

9. The method of claim 1, wherein said predicting recommended medication using a weighted distance average K-nearest neighbor algorithm based on said correlations comprises:

obtaining and validating patient sample x_aAdjacent k sets of projection vectors X' ═ { X ═ X₁′，x₂′，...，x_k′}，X′∈R^v×kV is the number of linear combinations, which are the dimensions of a typical correlation vector, and a set of non-linear combinations is obtained by XThe original complaint data X and the medication data Y learned by the encoder calculate the average medication value by the nearest neighbor inverse distance weighting according to the following formula (8)

According to the average dosage

wherein,

indicating that the sum of the correlation coefficients after processing is 1,

10. The method of claim 1 wherein said mean on drug basis

Whether the dosage is larger than a second set threshold value or not, if so, recommending the medication data Y to be used, setting the medication result to be 1, if not, not recommending the medication data Y to be used, setting the medication result to be 0, and setting the second set threshold valueValues are obtained by training in a training set.