CN109599185B

CN109599185B - Disease data processing method and device, electronic equipment and computer readable medium

Info

Publication number: CN109599185B
Application number: CN201811351658.1A
Authority: CN
Inventors: 丁浩洋
Original assignee: Golden Panda Ltd
Current assignee: Golden Panda Ltd
Priority date: 2018-11-14
Filing date: 2018-11-14
Publication date: 2021-05-25
Anticipated expiration: 2038-11-14
Also published as: CN109599185A

Abstract

The disclosure relates to a disease data processing method, a disease data processing device, an electronic device and a computer readable medium. Relates to the field of medical big data processing, and the method comprises the following steps: obtaining disease data, wherein the disease data comprises at least one disease symptom label; performing word segmentation on the disease data to generate a vocabulary set; constructing a symptom set from a vocabulary set, the symptom set comprising at least one disease symptom tag; and inputting the symptom set into a diagnosis model to obtain a disease classification identifier, wherein the diagnosis model is an artificial neural network model. The disease data processing method, the disease data processing device, the electronic equipment and the computer readable medium can improve the disease prediction accuracy and make better auxiliary decision for diagnosis of a clinician.

Description

Disease data processing method and device, electronic equipment and computer readable medium

Technical Field

The present disclosure relates to the field of computer information processing, and in particular, to a disease data processing method and apparatus, an electronic device, and a computer readable medium.

Background

A large amount of disease data exists in hospital clinical data, and generally, the disease data often includes a diagnosis result of a disease of a patient and a manifestation symptom of the patient. Disease data can reveal the relationship between a patient's disease and a disease symptom signature from various aspects. How to utilize disease data for data mining to provide assistant decision support for disease diagnosis of clinicians is a current topic.

At present, a data mining method of common disease data is a naive bayes method: by counting the frequency of disease diagnosis and the frequency of single symptom in large-scale sample data, the conditional probability of diagnosis and the conditional probability of single symptom are calculated, and model parameters are obtained. The symptom combination is input into the model, so that the distribution of disease diagnosis under the condition that the symptom combination appears can be predicted, and a clinician is helped to make an auxiliary decision on the disease diagnosis. However, this method of disease diagnosis by naive bayes generally has a low accuracy of judgment.

Therefore, a new disease data processing method, device, electronic device and computer readable medium are needed.

The above information disclosed in this background section is only for enhancement of understanding of the background of the disclosure and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

In view of this, the present disclosure provides a disease data processing method, a disease data processing apparatus, an electronic device, and a computer readable medium, which can improve the disease prediction accuracy and make better aid decision for the diagnosis of a clinician.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to an aspect of the present disclosure, a disease data processing method is provided, which includes: obtaining disease data, wherein the disease data comprises at least one disease symptom label; performing word segmentation on the disease data to generate a vocabulary set; constructing a symptom set from a vocabulary set, the symptom set comprising at least one disease symptom tag; and inputting the symptom set into a diagnosis model to obtain a disease classification identifier, wherein the diagnosis model is an artificial neural network model.

In an exemplary embodiment of the present disclosure, further comprising: and constructing the diagnosis model through historical disease data and an artificial neural network model.

In an exemplary embodiment of the present disclosure, constructing the diagnostic model from historical disease data and an artificial neural network model comprises: constructing a first data pair and a second data pair from historical disease data, the first data pair including at least one disease symptom tag and a diagnosis, the second data pair including a single disease symptom tag and a single diagnosis; generating a word embedding vector through the second data pair; and inputting the first data, the second data pairs and the word embedding vectors into an artificial neural network model, and obtaining a diagnosis model after training.

In an exemplary embodiment of the present disclosure, constructing the first data pair and the second data pair from the historical disease data comprises: generating a first data pair by performing word segmentation on historical disease data; and decomposing the first data pair to generate at least one second data pair.

In an exemplary embodiment of the present disclosure, generating a word embedding vector by the second data pair includes: constructing a diagnostic network through the second data pair, wherein objects in the data pair are used as points of the diagnostic network, and the relation between the objects is used as an edge of the diagnostic network; and generating a word embedding vector with the diagnostic network through a network embedding technology.

In an exemplary embodiment of the disclosure, inputting the first data pair, the second data pair, and the word embedding vector into an artificial neural network model, and obtaining a diagnostic model after training includes: taking the first data pair as training data of an artificial neural network model; using the second data pair as a label set of an artificial neural network model; taking the word embedding vector as a parameter of an artificial neural network embedding layer; and training an artificial neural network model through setting to obtain the diagnosis model.

In an exemplary embodiment of the present disclosure, the artificial neural network model includes at least: a symptom embedding layer, a maximum pooling layer, and an affine transformation layer.

According to an aspect of the present disclosure, there is provided a disease data processing apparatus, the apparatus including: the data module is used for acquiring disease data, and the disease data comprises at least one disease symptom label; the word segmentation module is used for carrying out word segmentation on the disease data to generate a word set; a data pair module for constructing a symptom set from a vocabulary set, the symptom set including at least one disease symptom tag; and a result module, which is used for inputting the symptom set into a diagnosis model to obtain a disease classification identifier, wherein the diagnosis model is an artificial neural network model.

According to an aspect of the present disclosure, an electronic device is provided, the electronic device including: one or more processors; storage means for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as above.

According to an aspect of the disclosure, a computer-readable medium is proposed, on which a computer program is stored, which program, when being executed by a processor, carries out the method as above.

According to the disease data processing method, device, electronic equipment and computer readable medium disclosed by the disclosure, the artificial neural network can be directly fitted with the conditional probability of disease diagnosis under the symptom combination condition for judgment; inputting the first data pair into a model, and performing maximum pooling operation on symptom combinations, so that the diagnosis model can effectively learn the association dependence between the symptom combinations and disease diagnosis, and discard unreasonable conditional independence assumptions; and then, semantic information of symptoms can be effectively captured through a symptom embedding layer in the disease model, so that the disease prediction accuracy is improved, and a better auxiliary decision is made for the diagnosis of a clinician.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings. The drawings described below are merely some embodiments of the present disclosure, and other drawings may be derived from those drawings by those of ordinary skill in the art without inventive effort.

Fig. 1 is a system block diagram illustrating a disease data processing method and apparatus according to an exemplary embodiment.

Fig. 2 is a flow chart illustrating a disease data processing method according to an example embodiment.

Fig. 3 is a flowchart illustrating a disease data processing method according to another exemplary embodiment.

Fig. 4 is a schematic diagram illustrating an artificial neural network in a disease data processing method according to another exemplary embodiment.

Fig. 5 is a block diagram illustrating a disease data processing apparatus according to an exemplary embodiment.

FIG. 6 is a block diagram illustrating an electronic device in accordance with an example embodiment.

FIG. 7 is a schematic diagram illustrating a computer-readable storage medium according to an example embodiment.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various components, these components should not be limited by these terms. These terms are used to distinguish one element from another. Thus, a first component discussed below may be termed a second component without departing from the teachings of the disclosed concept. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It is to be understood by those skilled in the art that the drawings are merely schematic representations of exemplary embodiments, and that the blocks or processes shown in the drawings are not necessarily required to practice the present disclosure and are, therefore, not intended to limit the scope of the present disclosure.

The inventors of the present application have found that naive bayes is a classical statistical learning method. The calculation principle is as follows:

P(Y|x₁x₂…)∝P(x₁x₂…|Y)*P(Y)＝P(x₁|Y)*P(x₂|Y)*…*P(Y)

wherein Y represents a disease diagnosis, x₁x₂.., indicates a combination of symptoms.

The naive Bayes method converts the posterior probability solution problem diagnosed under the condition of known symptom combination into the prior probability solution of diagnosis and the joint conditional probability solution of symptom combination according to the Bayes principle. And the naive bayes method further assumes that conditional independence exists between symptoms, and decomposes the joint conditional probability of a symptom combination into the product of the conditional probabilities of the individual symptoms.

By counting the frequency of disease diagnosis and the frequency of single symptom in large-scale sample data, the conditional probability of diagnosis and the conditional probability of single symptom are calculated, and model parameters are obtained. By inputting the symptom combinations into the model, the distribution of disease diagnosis under the condition that the symptom combinations appear can be predicted.

However, naive bayes based disease diagnosis suffers from the following drawbacks:

naive Bayes belongs to a generation model in a machine learning problem, the conditional probability of disease diagnosis under the condition of symptom combination is indirectly calculated, and the model accuracy is generally lower than that of a discrimination model in practice;

naive bayes method assumes conditional independence between symptoms in joint conditional probability calculations for combinations of symptoms. However, this conditional independence assumption does not hold, rather there is a strong associative dependency between disease symptom signatures;

naive bayes methods cannot learn the semantic information of disease symptom labels.

Based on the above reasons, the inventor of the present application proposes a disease data processing method, which utilizes massive < symptom combination, disease diagnosis > data pairs in clinical data, and adopts an artificial neural network for modeling. The model can predict disease diagnosis distribution according to symptom combination of patients, and provides assistant decision for clinicians.

Compared with a naive Bayes method, the artificial neural network can directly fit the conditional probability of disease diagnosis under the symptom combination condition, and belongs to a discrimination model; maximum pooling operation is carried out on symptom combinations in the model, so that the model can effectively learn the association dependence between the symptom combinations and disease diagnosis, and unreasonable conditional independence hypothesis is lost; a symptom embedding layer is added in the model, so that semantic information of symptoms can be effectively captured, and the prediction capability of the model is greatly improved.

The following is a detailed description of the present application:

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have installed thereon various communication client applications, such as a disease data analysis application, a web browser application, a search-type application, an instant messaging tool, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 105 may be a server that provides various services, such as a background management server that supports a disease analysis-type website browsed by a user using the

terminal devices

101, 102, 103. The background management server may analyze and process the received data such as the disease symptom tag analysis request, and feed back a processing result (e.g., disease probability, disease diagnosis analysis) to the terminal device.

Server 105 may, for example, obtain disease data including at least one disease symptom tag; the server 105 may, for example, perform word segmentation on the disease data to generate a vocabulary set; server 105 may construct a symptom set, e.g., from a vocabulary set, the symptom set including at least one disease symptom tag; the server 105 may, for example, input the symptom set into a diagnostic model, which is an artificial neural network model, to obtain the disease classification identification.

The server 105 may be a single entity server, or may be composed of a plurality of servers, for example, it should be noted that the disease data processing method provided by the embodiment of the present disclosure may be executed by the server 105, and accordingly, the disease data processing apparatus may be disposed in the server 105. And the web page end provided for the user to perform data query is generally located in the

terminal equipment

101, 102, 103.

Fig. 2 is a flow chart illustrating a disease data processing method according to an example embodiment. The disease data processing method 20 includes at least steps S202 to S208.

As shown in fig. 2, in S202, disease data including at least one disease symptom tag is acquired. Disease data may be obtained, for example, by input from a physician, and include symptoms of patient complaints, and may also include, for example, confirmed symptoms after examination by a physician.

In S204, the disease data is subjected to word segmentation processing to generate a vocabulary set. After the word segmentation processing of the doctor diagnosis words to be processed and the word segmentation processing of the "lower ureter stone with obstruction" can be performed, the generated word set can be, for example: calculus, right side, ureter, lower segment, obstruction.

Chinese Word Segmentation refers to the Segmentation of a Chinese character sequence into a single Word. Word segmentation is a process of recombining continuous word sequences into word sequences according to a certain specification. Existing word segmentation algorithms can be divided into three major categories: a word segmentation method based on character string matching, a word segmentation method based on understanding and a word segmentation method based on statistics. Whether the method is combined with the part-of-speech tagging process or not can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging.

The character matching is also called mechanical word segmentation method, which matches the Chinese character string to be analyzed with the entry in a sufficiently large machine dictionary according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful (a word is identified). According to different scanning directions, the string matching word segmentation method can be divided into forward matching and reverse matching; according to the situation of preferential matching with different lengths, the maximum (longest) matching and the minimum (shortest) matching can be divided.

In the method, the computer simulates the understanding of a sentence by a person to achieve the effect of recognizing words. The basic idea is to analyze syntax and semantics while segmenting words, and to process ambiguity phenomenon by using syntax information and semantic information. It generally comprises three parts: word segmentation subsystem, syntax semantic subsystem, and master control part. Under the coordination of the master control part, the word segmentation subsystem can obtain syntactic and semantic information of related words, sentences and the like to judge word segmentation ambiguity, namely the word segmentation subsystem simulates the process of understanding sentences by people. This word segmentation method requires the use of a large amount of linguistic knowledge and information. Because of the generality and complexity of Chinese language knowledge, it is difficult to organize various language information into a form that can be directly read by a machine, so that the existing understanding-based word segmentation system is still in a test stage.

Statistically, a word is a stable combination of words in terms of form, and thus, in this context, the more times adjacent words appear simultaneously, the more likely it is to constitute a word. Therefore, the frequency or probability of the co-occurrence of the characters and the adjacent characters can better reflect the credibility of the words. The frequency of the combination of adjacent co-occurring words in the material can be counted to calculate their co-occurrence information. The practical statistical word segmentation system uses a basic word segmentation dictionary (common word dictionary) to perform string matching word segmentation, and simultaneously uses a statistical method to identify some new words, namely, the string frequency statistics and the string matching are combined, so that the characteristics of high matching word segmentation speed and high efficiency are exerted, and the advantages of dictionary-free word segmentation combined with context recognition word generation and automatic ambiguity elimination are utilized.

In one embodiment, the disease data may be participled, for example, by a string matching method, to generate a plurality of participle words; and generating the vocabulary set through the plurality of participle vocabularies. The word segmentation method in the present application may also be performed by using the statistical word segmentation method or the understanding word segmentation method described above, and may also be performed by combining one or more of a character string matching method, an understanding word segmentation method, and a statistical word segmentation method, for example, which is not limited in the present application. Wherein, the machine dictionary in the character string matching method comprises: standard words in the ICH international medical phrase dictionary; and medical professional vocabulary.

The ICH dictionary of International medicine dictionary (MeddRA) is created under the initiative of ICH and is a standard term set used by government and pharmacy departments and the biopharmaceutical industry in various stages of clinical research before and after the administration of new drugs. The term set supports the encoding, retrieval and analysis of various clinical data, such as adverse events, medical and social history, indications and clinical examinations. Background information such as the reason and history of the creation of the MedDRA, the hierarchy of the MedDRA terms, the rules and habits of the MedDRA, the application of the MedDRA in data encoding and analysis, and the administrative requirements of the ICH participating national/regional government for the use of the MedDRA are described herein. The clinical research report is also called "medical term dictionary for drug registration".

In S206, a symptom set is constructed from the vocabulary set, the symptom set including at least one disease symptom tag. The data in the vocabulary set is extracted to form a first data pair, which may for example be in the form of < symptom combination, disease diagnosis >, wherein the symptom combination includes at least one disease symptom tag.

The first data pair may for example be < (abdominal pain, vomiting, anuria), (stone) >, where the symptom combination is: abdominal pain, vomiting, anuresis, disease diagnosis: calculus is caused.

In S208, the symptom set is input into a diagnosis model to obtain a disease classification identifier, wherein the diagnosis model is an artificial neural network model. The diagnostic model is a model trained by an artificial neural network method, and a specific training method will be described later. Disease symptom labels may be, for example: the abdominal pain, the vomit and the anuria are input into the diagnosis model, and the diagnosis model gives an auxiliary diagnosis result.

An Artificial Neural Network (ANN) is a complex Network structure formed by connecting a large number of processing units (neurons), and is an abstraction, simplification, and simulation of a human brain organization structure and an operation mechanism.

The artificial neural network is divided into a plurality of layers and a single layer, each layer comprises a plurality of neurons, the neurons are connected by directed arcs with variable weights, and the network achieves the purpose of processing information and simulating the relation between input and output by a method of gradually adjusting and changing the connection weights of the neurons through repeated learning and training of known information. It does not need to know the exact relation between input and output, does not need a large number of parameters, and only needs to know the non-constant factor causing the output change, namely the non-quantitative parameter. Therefore, compared with the traditional data processing method, the neural network technology has obvious advantages in the aspects of processing fuzzy data, random data and nonlinear data, and is particularly suitable for systems with large scale, complex structure and ambiguous information.

For the disease diagnosis scene in the embodiment of the application, the artificial neural network can directly fit the conditional probability of disease diagnosis under the symptom combination condition for judgment; inputting the first data pair into a model, and performing maximum pooling operation on symptom combinations, so that the diagnosis model can effectively learn the association dependence between the symptom combinations and disease diagnosis, and discard unreasonable conditional independence assumptions; and then, semantic information of symptoms can be effectively captured through a symptom embedding layer in the disease model, and the prediction capability of the model is greatly improved.

According to the disease data processing method disclosed by the invention, the disease prediction accuracy can be improved, and a better auxiliary decision can be made for the diagnosis of a clinician.

It should be clearly understood that this disclosure describes how to make and use particular examples, but the principles of this disclosure are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.

As shown in fig. 3, in S302, a first data pair including at least one disease symptom tag and a diagnosis and a second data pair including a single disease symptom tag and a single diagnosis are constructed from historical disease data. For example, the historical disease data is subjected to word segmentation processing to generate a first data pair; and decomposing the first data pair to generate at least one second data pair.

As described above, disease data often includes patient disease diagnosis and patient performance symptoms. Disease data can reveal the relationship between a patient's disease and a disease symptom signature from various aspects. Disease data can be obtained, for example, via electronic medical records, and diagnostic data can also be obtained, for example, via diagnostic reports.

The electronic medical record data is directly acquired from clinical diagnosis and treatment behaviors, and the acquisition process is fused with the clinical diagnosis and treatment process, so that no additional burden is formed on medical staff. By analyzing the disease onset characteristics and diagnosis and treatment method characteristics of a large number of diseases by using the data, the working efficiency of related medical personnel can be greatly improved, and the progress of disease prevention and treatment efficiency can be promoted more efficiently and at low cost.

In one embodiment, < symptom combination, disease diagnosis > data pairs present in the hospital's historical clinical data are extracted by way of word segmentation, first data pairs are constructed, and a plurality of the first data pairs may constitute the form of data set a. The data in the vocabulary set is extracted to form a first data pair, which may for example be in the form of < symptom combination, disease diagnosis >, wherein the symptom combination includes at least one disease symptom tag.

Wherein each first data pair is further decomposed into a plurality of < monoscopic, diagnostic of disease > data pairs, a second data pair is constructed, and the plurality of second data pairs may be in the form of a data set B.

The second data pairs corresponding to the first data pairs are three: < abdominal pain, calculus >; < emesis, calculus >; < No urine and calculus >.

In S304, a word embedding vector is generated from the second data pair. The method comprises the following steps: constructing a diagnostic network through the second data pair, wherein objects in the data pair are used as points of the diagnostic network, and the relation between the objects is used as an edge of the diagnostic network; and generating a word embedding vector with the diagnostic network through a network embedding technology.

In one embodiment, a single symptom and disease diagnosis graph is constructed using data set B, where the set of vertices is all single symptoms and disease diagnoses and the set of edges is an association of all single symptoms and corresponding disease diagnoses present in data set B. The word embedding vectors of the single symptom are learned by using the existing network embedding technology (such as LINE and the like).

Word embedding space (vector) is a generic term for a set of language models and feature learning techniques in natural language processing, where words in a vocabulary are mapped to real vectors in a low-dimensional space relative to the size of the vocabulary.

Network embedding: the method is a word embedding technology, and embedding vectors of words are learned through connection relations among the words.

LINE is an algorithm for embedding a large-scale information network into a low-dimensional vector space, and the algorithm is very effective in many fields such as visualization, node classification, and link prediction. This method is applicable to any type of information network, whether directed, undirected, or weighted. This method uses an optimized objective function for preserving both global and local network structure.

The LINE has an edge sampling algorithm, so that the limitation of classical random gradient descent is solved, and the effectiveness and efficiency of reasoning are improved. Experiments demonstrate the effectiveness of the LINE algorithm for various information networks in the real world, including language networks, social networks, and reference networks. The algorithm is very efficient and can learn an embedded vector of a network with millions of vertices and billions of edges in a few hours on a single machine.

There are many techniques for network embedding, and continuous optimization is also in progress. According to the scheme, a LINE is adopted as a network embedding method, and other methods can be used for replacing and completing the initialization work of the artificial neural network.

In S306, the first data, the second data pair, and the word embedding vector are input into an artificial neural network model, and a diagnostic model is obtained through training. The method comprises the following steps: taking the first data pair as training data of an artificial neural network model; using the second data pair as a label set of an artificial neural network model; taking the word embedding vector as a parameter of an artificial neural network embedding layer; and training an artificial neural network model through setting to obtain the diagnosis model.

And learning the symptom word embedding vector by adopting a network embedding method, and using the vector as an initial value of the artificial neural network symptom embedding layer parameter.

As shown in fig. 4, the artificial neural network model in this embodiment at least includes: a symptom embedding layer, a maximum pooling layer, and an affine transformation layer.

The plurality of first data pairs in data set a are input into the neural network model, and such batch training may accelerate the network training process. The input process of the first data pair into the neural network model further comprises space occupying process, and the space occupying process is used for data alignment in the batch training process of the artificial neural network.

And inputting the word embedding vector into an embedding layer of the neural network, and calculating a plurality of first data pairs in the data set A through the embedding layer to generate an intermediate data set.

And the intermediate data set is calculated through a maximum pooling layer in the artificial neural network to generate a symptom embedding vector.

The symptom embedding vector outputs different disease diagnosis categories through affine transformation.

According to the disease data processing method disclosed by the invention, the artificial neural network is initialized by utilizing the symptom embedding vector with semantic information learned by the network embedding technology.

The structure of the artificial neural network in the present application can also be adjusted and optimized according to the needs of calculation, but the basic framework is basically consistent with the content disclosed in the present application.

According to the disease data processing method of the present disclosure, an artificial neural network is used to predict disease diagnosis distribution for symptom combinations. And further optimizing the word embedding vector, and capturing semantic information of symptoms better. Therefore, the model prediction capability is improved, and better auxiliary decision is provided for the diagnosis of a clinician.

There is a large amount of patient disease diagnosis and symptom combination data in hospital clinical data. Disease data processing method of the present disclosure

By using the < symptom combination, disease diagnosis > data pair with the many-to-one mapping relation and adopting the artificial neural network to carry out mathematical modeling, the model can predict disease diagnosis distribution according to the symptom combination of the patient, thereby providing assistant decision help for the disease diagnosis of a clinician.

Those skilled in the art will appreciate that all or part of the steps implementing the above embodiments are implemented as computer programs executed by a CPU. When executed by the CPU, performs the functions defined by the above-described methods provided by the present disclosure. The program may be stored in a computer readable storage medium, which may be a read-only memory, a magnetic or optical disk, or the like.

Furthermore, it should be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.

Fig. 5 is a block diagram illustrating a disease data processing apparatus according to an exemplary embodiment. The disease data processing device 50 includes: a data module 502, a word segmentation module 504, a data pair module 506, and a results module 508.

The data module 502 is used for acquiring disease data, wherein the disease data comprises at least one disease symptom label and a diagnosis result; as described above, disease data often includes patient disease diagnosis and patient performance symptoms. Disease data can reveal the relationship between a patient's disease and a disease symptom signature from various aspects. Disease data can be obtained, for example, via electronic medical records, and diagnostic data can also be obtained, for example, via diagnostic reports.

The word segmentation module 504 is configured to perform word segmentation on the disease data to generate a vocabulary set; the disease data can be subjected to word segmentation processing by a character string matching method, for example, so that a plurality of word segmentation vocabularies are generated; and generating the vocabulary set through the plurality of participle vocabularies.

The data pair module 506 is used for constructing a first data pair through the vocabulary set, wherein the first data pair comprises at least one disease symptom label and a diagnosis result; the data in the vocabulary set is extracted to form a first data pair, which may for example be in the form of < symptom combination, disease diagnosis >, wherein the symptom combination includes at least one disease symptom tag. The first data pair may for example be < (abdominal pain, vomiting, anuria), (stone) >, where the symptom combination is: abdominal pain, vomiting, anuresis, disease diagnosis: calculus is caused.

The result module 508 is configured to input the first data pair into a diagnosis model to obtain a disease diagnosis result, where the disease diagnosis model is an artificial neural network model.

According to the disease data processing device disclosed by the invention, the disease prediction accuracy can be improved, and a better auxiliary decision can be made for the diagnosis of a clinician.

An electronic device 200 according to this embodiment of the present disclosure is described below with reference to fig. 6. The electronic device 200 shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, the electronic device 200 is embodied in the form of a general purpose computing device. The components of the electronic device 200 may include, but are not limited to: at least one processing unit 210, at least one memory unit 220, a bus 230 connecting different system components (including the memory unit 220 and the processing unit 210), a display unit 240, and the like.

Wherein the storage unit stores program code executable by the processing unit 210 to cause the processing unit 210 to perform the steps according to various exemplary embodiments of the present disclosure described in the above-mentioned electronic prescription flow processing method section of the present specification. For example, the processing unit 210 may perform the steps shown in fig. 2 and 3.

The memory unit 220 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)2201 and/or a cache memory unit 2202, and may further include a read only memory unit (ROM) 2203.

The storage unit 220 may also include a program/utility 2204 having a set (at least one) of program modules 2205, such program modules 2205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 230 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 200 may also communicate with one or more external devices 300 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 200, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 200 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 250. Also, the electronic device 200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 260. The network adapter 260 may communicate with other modules of the electronic device 200 via the bus 230. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, or a network device, etc.) to execute the above method according to the embodiments of the present disclosure.

Fig. 7 schematically illustrates a computer-readable storage medium in an exemplary embodiment of the disclosure.

Referring to fig. 7, a program product 400 for implementing the above method according to an embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The computer readable medium carries one or more programs which, when executed by a device, cause the computer readable medium to perform the functions of: acquiring disease data, wherein the disease data comprises at least one disease symptom label and a diagnosis result; performing word segmentation on the disease data to generate a vocabulary set; constructing a first data pair by a vocabulary set, the first data pair comprising at least one disease symptom tag and a diagnosis; and inputting the first data pair into a diagnosis model to obtain a disease diagnosis result, wherein the diagnosis model is an artificial neural network model.

Those skilled in the art will appreciate that the modules described above may be distributed in the apparatus according to the description of the embodiments, or may be modified accordingly in one or more apparatuses unique from the embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Exemplary embodiments of the present disclosure are specifically illustrated and described above. It is to be understood that the present disclosure is not limited to the precise arrangements, instrumentalities, or instrumentalities described herein; on the contrary, the disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

In addition, the structures, the proportions, the sizes, and the like shown in the drawings of the present specification are only used for matching with the contents disclosed in the specification, so as to be understood and read by those skilled in the art, and are not used for limiting the limit conditions which the present disclosure can implement, so that the present disclosure has no technical essence, and any modification of the structures, the change of the proportion relation, or the adjustment of the sizes, should still fall within the scope which the technical contents disclosed in the present disclosure can cover without affecting the technical effects which the present disclosure can produce and the purposes which can be achieved. In addition, the terms "above", "first", "second" and "a" as used in the present specification are for the sake of clarity only, and are not intended to limit the scope of the present disclosure, and changes or modifications of the relative relationship may be made without substantial changes in the technical content.

Claims

1. A method of disease data processing, comprising:

obtaining disease data, wherein the disease data comprises at least one disease symptom label;

performing word segmentation on the disease data to generate a vocabulary set;

constructing a symptom set from a vocabulary set, the symptom set comprising at least one disease symptom tag; and

inputting the symptom set into a diagnosis model to obtain a disease classification identifier, wherein the diagnosis model is an artificial neural network model;

the method further comprises the following steps:

constructing a first data pair and a second data pair from historical disease data, the first data pair including at least one disease symptom tag and a diagnosis, the second data pair including a single disease symptom tag and a single diagnosis; the second data pair is obtained by decomposing the first data pair;

generating a word embedding vector through the second data pair, wherein the word embedding vector is a word embedding vector of a single disease symptom; and

inputting the first data pair, the second data pair and the word embedding vector into an artificial neural network model, and obtaining the diagnosis model after training;

wherein the generating a word embedding vector by the second data pair comprises:

constructing a diagnostic network through the second data pair, wherein objects in the data pair are used as points of the diagnostic network, and the relation between the objects is used as an edge of the diagnostic network; and

generating a word embedding vector with the diagnostic network through a network embedding technique.

2. The method of claim 1, wherein constructing the first data pair and the second data pair from historical disease data comprises:

performing word segmentation processing on historical disease data according to an international medical word dictionary to generate a first data pair; and

the first data pair is decomposed according to disease symptom tags to generate at least one second data pair.

3. The method of claim 1, wherein the first data pair, the second data pair, and the word embedding vector are input into an artificial neural network model, and wherein obtaining a diagnostic model after training comprises:

taking the first data pair as training data of an artificial neural network model;

using the second data pair as a label set of an artificial neural network model;

taking the word embedding vector as a parameter of an artificial neural network embedding layer; and

training an artificial neural network model by setting to obtain the diagnosis model.

4. The method of claim 1, in which the artificial neural network model comprises at least: a symptom embedding layer, a maximum pooling layer, and an affine transformation layer.

5. A disease data processing apparatus, characterized by comprising:

the data module is used for acquiring disease data, and the disease data comprises at least one disease symptom label;

the word segmentation module is used for carrying out word segmentation on the disease data to generate a word set;

a data pair module for constructing a symptom set from a vocabulary set, the symptom set including at least one disease symptom tag; and

a result module, configured to input the symptom set into a diagnosis model to obtain a disease classification identifier, where the diagnosis model is an artificial neural network model;

wherein the diagnostic model is obtained by:

the generating a word embedding vector by the second data pair comprises:

6. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-4.

7. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-4.