CN112017771B - Method and system for constructing disease prediction model based on semen routine inspection data - Google Patents
Method and system for constructing disease prediction model based on semen routine inspection data Download PDFInfo
- Publication number
- CN112017771B CN112017771B CN202010900071.2A CN202010900071A CN112017771B CN 112017771 B CN112017771 B CN 112017771B CN 202010900071 A CN202010900071 A CN 202010900071A CN 112017771 B CN112017771 B CN 112017771B
- Authority
- CN
- China
- Prior art keywords
- data
- semen
- disease
- knowledge base
- sample set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 210000000582 semen Anatomy 0.000 title claims abstract description 102
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 title claims abstract description 65
- 201000010099 disease Diseases 0.000 title claims abstract description 64
- 238000007689 inspection Methods 0.000 title claims abstract description 36
- 238000000034 method Methods 0.000 title claims abstract description 23
- 230000001900 immune effect Effects 0.000 claims abstract description 25
- 238000012549 training Methods 0.000 claims abstract description 20
- 238000013528 artificial neural network Methods 0.000 claims abstract description 15
- 238000004140 cleaning Methods 0.000 claims abstract description 12
- 238000012795 verification Methods 0.000 claims abstract description 4
- 239000013598 vector Substances 0.000 claims description 45
- 238000004364 calculation method Methods 0.000 claims description 16
- 238000001514 detection method Methods 0.000 claims description 10
- 238000012360 testing method Methods 0.000 claims description 8
- 238000013507 mapping Methods 0.000 claims description 7
- 230000008094 contradictory effect Effects 0.000 claims description 6
- 238000010876 biochemical test Methods 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims 1
- 230000006870 function Effects 0.000 abstract description 16
- 238000010801 machine learning Methods 0.000 abstract description 3
- 230000002265 prevention Effects 0.000 abstract description 3
- 210000002569 neuron Anatomy 0.000 description 22
- 201000010653 vesiculitis Diseases 0.000 description 9
- 230000019100 sperm motility Effects 0.000 description 8
- 208000002193 Pain Diseases 0.000 description 7
- 206010039954 Seminal vesiculitis Diseases 0.000 description 5
- 206010003883 azoospermia Diseases 0.000 description 5
- 210000003743 erythrocyte Anatomy 0.000 description 5
- 210000004205 output neuron Anatomy 0.000 description 5
- 230000002159 abnormal effect Effects 0.000 description 4
- 210000000265 leukocyte Anatomy 0.000 description 4
- 208000008634 oligospermia Diseases 0.000 description 4
- 230000036616 oligospermia Effects 0.000 description 4
- 231100000528 oligospermia Toxicity 0.000 description 4
- 201000007094 prostatitis Diseases 0.000 description 4
- 208000024891 symptom Diseases 0.000 description 4
- 238000003745 diagnosis Methods 0.000 description 3
- 230000027939 micturition Effects 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 230000005659 seminal clot liquefaction Effects 0.000 description 3
- 210000001625 seminal vesicle Anatomy 0.000 description 3
- 201000008827 tuberculosis Diseases 0.000 description 3
- 210000001177 vas deferen Anatomy 0.000 description 3
- 208000035473 Communicable disease Diseases 0.000 description 2
- 206010028980 Neoplasm Diseases 0.000 description 2
- MUMGGOZAMZWBJJ-DYKIIFRCSA-N Testostosterone Chemical compound O=C1CC[C@]2(C)[C@H]3CC[C@](C)([C@H](CC4)O)[C@@H]4[C@@H]3CCC2=C1 MUMGGOZAMZWBJJ-DYKIIFRCSA-N 0.000 description 2
- 230000005856 abnormality Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 2
- 208000021760 high fever Diseases 0.000 description 2
- 208000015181 infectious disease Diseases 0.000 description 2
- 230000003902 lesion Effects 0.000 description 2
- 230000033001 locomotion Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000004899 motility Effects 0.000 description 2
- 210000004994 reproductive system Anatomy 0.000 description 2
- 230000002381 testicular Effects 0.000 description 2
- 210000001550 testis Anatomy 0.000 description 2
- 208000008035 Back Pain Diseases 0.000 description 1
- 206010006895 Cachexia Diseases 0.000 description 1
- 206010008805 Chromosomal abnormalities Diseases 0.000 description 1
- 208000031404 Chromosome Aberrations Diseases 0.000 description 1
- 206010010356 Congenital anomaly Diseases 0.000 description 1
- 102000004190 Enzymes Human genes 0.000 description 1
- 108090000790 Enzymes Proteins 0.000 description 1
- 102000012673 Follicle Stimulating Hormone Human genes 0.000 description 1
- 108010079345 Follicle Stimulating Hormone Proteins 0.000 description 1
- 229930091371 Fructose Natural products 0.000 description 1
- RFSUNEUAIZKAJO-ARQDHWQXSA-N Fructose Chemical compound OC[C@H]1O[C@](O)(CO)[C@@H](O)[C@@H]1O RFSUNEUAIZKAJO-ARQDHWQXSA-N 0.000 description 1
- 239000005715 Fructose Substances 0.000 description 1
- 206010058821 Genital tract inflammation Diseases 0.000 description 1
- 206010018866 Haematospermia Diseases 0.000 description 1
- 208000006083 Hypokinesia Diseases 0.000 description 1
- 206010062767 Hypophysitis Diseases 0.000 description 1
- 208000008930 Low Back Pain Diseases 0.000 description 1
- 102000009151 Luteinizing Hormone Human genes 0.000 description 1
- 108010073521 Luteinizing Hormone Proteins 0.000 description 1
- 208000007466 Male Infertility Diseases 0.000 description 1
- 241000772415 Neovison vison Species 0.000 description 1
- 208000012868 Overgrowth Diseases 0.000 description 1
- 102000003946 Prolactin Human genes 0.000 description 1
- 108010057464 Prolactin Proteins 0.000 description 1
- 206010060862 Prostate cancer Diseases 0.000 description 1
- 206010050662 Prostate infection Diseases 0.000 description 1
- 208000000236 Prostatic Neoplasms Diseases 0.000 description 1
- 206010065805 Spermatic cord obstruction Diseases 0.000 description 1
- 206010041497 Spermatorrhoea Diseases 0.000 description 1
- 206010043315 Testicular failure Diseases 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000001154 acute effect Effects 0.000 description 1
- 230000001363 autoimmune Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000004369 blood Anatomy 0.000 description 1
- 239000008280 blood Substances 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 230000002759 chromosomal effect Effects 0.000 description 1
- 230000001684 chronic effect Effects 0.000 description 1
- 230000009194 climbing Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000011038 discontinuous diafiltration by volume reduction Methods 0.000 description 1
- 208000035475 disorder Diseases 0.000 description 1
- 230000004064 dysfunction Effects 0.000 description 1
- 206010013990 dysuria Diseases 0.000 description 1
- 208000026500 emaciation Diseases 0.000 description 1
- 230000002124 endocrine Effects 0.000 description 1
- 210000000918 epididymis Anatomy 0.000 description 1
- 201000010063 epididymitis Diseases 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000003925 fat Substances 0.000 description 1
- 239000012530 fluid Substances 0.000 description 1
- 210000004392 genitalia Anatomy 0.000 description 1
- 210000004907 gland Anatomy 0.000 description 1
- 210000002149 gonad Anatomy 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 238000011423 initialization method Methods 0.000 description 1
- 210000002364 input neuron Anatomy 0.000 description 1
- 230000003340 mental effect Effects 0.000 description 1
- 238000000386 microscopy Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 231100000915 pathological change Toxicity 0.000 description 1
- 230000036285 pathological change Effects 0.000 description 1
- 206010036596 premature ejaculation Diseases 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 230000001850 reproductive effect Effects 0.000 description 1
- 208000017443 reproductive system disease Diseases 0.000 description 1
- 150000003839 salts Chemical class 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 210000002966 serum Anatomy 0.000 description 1
- 231100000527 sperm abnormality Toxicity 0.000 description 1
- 229960003604 testosterone Drugs 0.000 description 1
- 230000035922 thirst Effects 0.000 description 1
- 201000004822 varicocele Diseases 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
- G06F18/24137—Distances to cluster centroïds
- G06F18/2414—Smoothing the distance, e.g. radial basis function networks [RBFN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention relates to a method and a system for constructing a disease prediction model based on semen routine inspection data, wherein the method comprises the following steps: acquiring semen biochemical examination data, immunological examination data and vital sign information of a sample crowd to form a first sample set; performing data cleaning and standardization on the first sample set according to a disease knowledge base corresponding to semen routine inspection data to form a second sample set; dividing the second sample set into a training set and a verification set, and then taking the training set as the input of a radial basis function neural network; and training the radial basis function neural network until the deviation between the output value and the true value is lower than a threshold value, and obtaining a disease prediction model. The invention builds a machine learning model by utilizing Radial Basis Functions (RBFs) based on a sample set built by multiple data sources so as to predict related diseases, can be used for basic doctors to learn and reference, is convenient for early self-check and prevention of patients, and has certain popularization and application values.
Description
Technical Field
The invention relates to the technical fields of intelligent medical treatment and medical information, relates to a method and a system for constructing a disease prediction model, and particularly relates to a method and a system for constructing a disease prediction model based on semen routine inspection data.
Background
Semen consists of sperm and seminal plasma, wherein the sperm accounts for 10 percent, and the rest is seminal plasma. It contains various enzymes and inorganic salts in addition to water, fructose, proteins and fats. Semen routine examination is primarily a preliminary laboratory examination of the volume, nature and function of semen. The content includes semen volume, color, viscosity, liquefaction time, sperm count, sperm motility, sperm morphology, semen cell examination, etc. Is mainly used for diagnosing male reproductive capacity and reproductive system diseases.
Immunological examination can determine whether autoimmune and chromosomal karyotyping is present and whether chromosomal abnormalities are present. Determination of serum FSH (follicle stimulating hormone), LH (luteinizing hormone), T (testosterone), PRL (prolactin) are important methods for oligospermia examination and also help to distinguish between primary or secondary testicular failure.
The existing diagnosis of the semen related diseases needs to rely on doctors and multiple examinations with abundant experience and strong professional ability to make an accurate diagnosis and treatment scheme. In the context of shortage of medical resources, a person to be tested or a patient usually needs to go through a period of examination and waiting time to obtain all examination results, so that uncertainty exists in the timeliness of examination data, thereby delaying the optimal diagnosis time of the patient and even causing misdiagnosis, and bringing mental loss and economic loss to the patient.
On the other hand, the medical services provided by the medical equipment resources and the professional ability of basic medical staff are limited by the shortage of basic medical institutions, and cannot meet the demands of the masses.
Disclosure of Invention
In order to relieve medical resource tension and physical examination pressure of basic medical institutions, facilitate self-checking prevention of patients and study and reference of basic doctors, the invention provides a method for constructing a disease prediction model based on semen routine examination data, which comprises the following steps: acquiring semen biochemical examination data, immunological examination data and vital sign information of a sample crowd to form a first sample set; performing data cleaning and standardization on the first sample set according to a disease knowledge base corresponding to semen routine inspection data to form a second sample set; dividing the second sample set into a training set and a verification set, and then taking the training set as the input of a radial basis function neural network; and training the radial basis function neural network until the deviation between the output value and the true value is lower than a threshold value, and obtaining a disease prediction model.
In some embodiments of the present invention, the data cleaning and standardization are performed on the first sample set according to the disease knowledge base corresponding to the semen routine inspection data, and the forming of the second sample set includes the following steps:
and eliminating data which do not accord with biological rules and contradictory data in the semen biochemical examination data according to a disease knowledge base, normalizing the semen biochemical examination data, and mapping the semen biochemical examination data to [0,1 ].
In some embodiments of the present invention, the data cleaning and standardization are performed on the first sample set according to the disease knowledge base corresponding to the semen routine inspection data, and the forming of the second sample set includes the following steps:
and normalizing the semen biochemical examination data according to the data of the immunological examination data which do not accord with the immunological rule and the contradictory data according to a disease knowledge base, and mapping the semen biochemical examination data onto [0,1 ].
In some embodiments of the present invention, the data cleaning and standardization are performed on the first sample set according to the disease knowledge base corresponding to the semen routine inspection data, and the forming of the second sample set includes the following steps:
and carrying out semantic similarity calculation on the vital sign information according to the disease knowledge base to obtain a corresponding characteristic value of the vital sign information, and eliminating data with low correlation with semen related diseases.
In the above embodiment, the second sample set includes normalized semen biochemical test data and immunological test data, and characteristic values of vital sign information of the living body test.
In another aspect of the invention, a system for predicting a disease model based on semen routine inspection data is provided, which comprises an acquisition module, a storage module, a matching module, a calculation module and a prediction model, wherein the acquisition module is used for acquiring semen biochemical inspection data, immunological inspection data and vital sign information of a person to be tested; the storage module is used for storing a disease knowledge base corresponding to the semen routine inspection data; the calculation module is used for matching the semen biochemical examination data, the immunological examination data and the vital sign information of the living body detection of the testee with the disease knowledge base, and normalizing the semen biochemical examination data and the immunological examination data to obtain a feature vector of the testee; the prediction model is used for predicting the illness probability of the testee according to the feature vector.
In some embodiments of the present invention, the calculating module performs semantic similarity calculation on the detected sign information of the living body according to the disease knowledge base, so as to obtain a first feature vector.
Further, the calculating module calculates semantic similarity between the disease knowledge base and sign information of living body detection through Euclidean distance to obtain a second feature vector; and obtaining the feature vector of the person to be tested according to the first feature vector and the second feature vector.
In some embodiments of the present invention, the prediction model includes a model constructed by the method for constructing a disease prediction model based on semen routine inspection data provided in the first aspect of the present invention.
Further, the predictive model includes a trained radial basis function neural network.
The beneficial effects of the invention are as follows:
1. according to the invention, based on the data set constructed by multiple data sources, the machine learning model is constructed by cleaning and normalizing the data set and then utilizing the Radial Basis Function (RBF), and the probability of the testee suffering from the diseases related to semen can be rapidly predicted through the machine learning model. The method can be used for basic level doctors to learn and reference, is convenient for early prediction and prevention of patients, and has certain popularization and application values.
2. The invention adopts different screening and cleaning methods aiming at different attributes of various semen inspection data, improves the effectiveness and accuracy of the data, reduces the training error and training time of the model, and thus has better robustness.
Drawings
FIG. 1 is a basic flow chart of a method of constructing a disease prediction model based on semen routine inspection data in some embodiments of the invention;
fig. 2 is a schematic structural diagram of a system for predicting a model of a disease based on semen routine inspection data in some embodiments of the invention.
Detailed Description
The principles and features of the present invention are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the invention and are not to be construed as limiting the scope of the invention.
The invention provides a method for constructing a disease prediction model based on semen routine inspection data, which comprises the following steps: s101, acquiring semen biochemical examination data and immunological examination data of sample groups, and forming a first sample set by physical sign information of living body detection; s102, carrying out data cleaning and standardization on the first sample set according to a disease knowledge base corresponding to semen routine inspection data to form a second sample set; s103, dividing the second sample set into a training set and a verification set, and then taking the training set as the input of a radial basis function neural network; s104, training the radial basis function neural network until the deviation between the output value and the true value is lower than a threshold value, and obtaining a disease prediction model.
Specifically, the biochemical parameters of each item index of semen routine examination are described below. For example, under microscopic examination, 1) White Blood Cells (WBCs) > 5/HPF, seen in genital tract inflammation (seminal vesiculitis, prostatitis), tuberculosis, tumors, etc.; 2) Red Blood Cells (RBCs) > 5/HPF, commonly found in seminal vesicle tuberculosis, prostate cancer, and the like. For another example, 1, pH: if the pH is less than 7.0, the composition is used for treating chronic infectious diseases, seminal vesicle hypofunction, congenital seminal vesicle deficiency, vas deferens obstruction and the like; 2. if the pH is more than 8.0, the patients with acute infectious diseases are mostly seen in accessory gonads or epididymis; 3. semen motility rate. If the sperm motility is less than 35%, the sperm motility is often the cause of male infertility, and is mainly found in varicocele, non-specific infection of the reproductive system, hypophysis dysfunction and the like.
The characteristic information of each type of characteristics in the semen routine inspection comprises the color, the character, the smell, the quantity and the like of the semen. For example, 1, semen color anomaly: in the case of yellow or brown purulent semen, it is common to seminal vesiculitis or prostatitis; if the semen is bloody semen with bright red, dark red or pink, the semen is mostly seen in seminal vesiculitis, prostatic tuberculosis and seminal vesiculum tumors are rare; 2. semen volume abnormality: excessive semen volume: it is often seen in oligospermia and seminal vesiculitis, and also in those with overgrowth of forbidden time; semen volume reduction: is used for treating oligospermia, testicular hypofunction, endocrine disturbance, seminal vesiculitis, prostatitis, genital system infection, etc.; semen-free fluid: is commonly seen in azoospermia; 3. abnormal semen liquefaction is usually found in the cases of prostate infection or lesions, such as the lesions of seminal vesicle glands and bulbar glands.
The sign information of the living body detection of the subject is as follows: for example, one or a combination of any several of testis distending pain, vas deferens pain, urgent urination, frequent urination, painful urination, high fever, chills, hypodynamia, waist soreness, spermatorrhea, premature ejaculation, thirst, emaciation, weakness, susceptibility to cold and the like; for example, semen may suffer from prostatitis if it is colorless and transparent, too thin, urgent, frequent, painful, high fever, chills; if semen is weak, debilitated and soreness of waist, oligospermia may occur; lean semen, distending pain in the testes, pain in the vas deferens, low back pain, which means that there may be symptoms of blood stasis. Preferably, the words or phrases are extracted by keywords, and irrelevant stop words are removed, namely the characteristic values of sign information detected by the living body of the detected person.
In step S102 of some embodiments of the present invention, performing data cleaning and standardization on the first sample set according to a disease knowledge base corresponding to semen routine inspection data, to form a second sample set includes the following steps:
and eliminating data which do not accord with biological rules and contradictory data in the semen biochemical examination data according to a disease knowledge base, normalizing the semen biochemical examination data, and mapping the semen biochemical examination data to [0,1 ].
In step S102 of some embodiments of the present invention, performing data cleaning and standardization on the first sample set according to a disease knowledge base corresponding to semen routine inspection data, to form a second sample set includes the following steps:
and normalizing the semen biochemical examination data according to the data of the immunological examination data which do not accord with the immunological rule and the contradictory data according to a disease knowledge base, and mapping the semen biochemical examination data onto [0,1 ].
Specifically, according to the clinical diagnostics standard, a biochemical parameter standard library of each item index of semen routine examination, various character characteristic information and symptom information library of semen routine examination and a possibly corresponding disease knowledge library are established through normalization. For example, semen routine examination generally involves extracting semen, and determining whether the semen volume, the sperm motility, the sperm count, the abnormal sperm volume, the semen liquefaction time, the semen pH, the total number of sperm, the sperm motility time, the sperm climbing, the erythrocyte, the leucocyte, etc. are abnormal, and whether the semen is in a normal state or an abnormal state is determined by detecting whether the semen is in a higher or lower state, and whether the semen is in an abnormal state. The method specifically comprises the following steps: the normal semen discharge value is 2-6 ml; the normal value of semen liquefaction time is: self-liquefying at 37 ℃ within 525 minutes; the pH normal value is: 7.2 to 7.8; semen motility (WHO standard): the lower limit of the reference value for sperm motility (PR+NP) was 40% and the lower limit of the reference value for forward motile sperm (PR) was 32%. The WHO standard sperm motility rate of a level, b level and c level is more than or equal to 60 percent; sperm motility (WHO standard): within 60 minutes after semen ejection, 50% or more sperm have forward motion (class a + class b), or 25% or more sperm have rapid forward motion (class a); microscopy: 1) White Blood Cell (WBC) normal value < 5/HPF; 2) Red Blood Cell (RBC) normal values < 5/HPF; 3) Sperm density: normal sperm density is around 2000-6000 ten thousand per milliliter. The above "clinical diagnostics" is only an example of a disease knowledge base corresponding to semen routine examination data, and is not to be taken as a limitation of the present invention. For example, the knowledge base of diseases related to the present invention includes "immunology" and "clinical genitalia", etc.
In step S102 of some embodiments of the present invention, performing data cleaning and standardization on the first sample set according to a disease knowledge base corresponding to semen routine inspection data, to form a second sample set includes the following steps:
and carrying out semantic similarity calculation on the vital sign information according to the disease knowledge base to obtain a corresponding characteristic value of the vital sign information, and eliminating data with low correlation with semen related diseases.
In the above embodiment, the second sample set includes normalized semen biochemical test data and immunological test data, and characteristic values of vital sign information of the living body test.
In another aspect of the present invention, a system for predicting a disease based on semen routine examination data is provided, which comprises an acquisition module 11, a storage module 12, a calculation module 13 and a prediction model 14, wherein the acquisition module 11 is used for acquiring semen biochemical examination data, immunological examination data and vital sign information of a living body detection of a person to be detected; the storage module 12 is used for storing a disease knowledge base corresponding to semen routine examination data; the computing module 13 is configured to match the semen biochemical inspection data, the immunological inspection data, and the vital sign information of the living body detection of the person to be tested with the disease knowledge base, and normalize the semen biochemical inspection data and the immunological inspection data to obtain a feature vector of the person to be tested; the prediction model 14 is used for predicting the disease probability of the testee according to the feature vector.
In some embodiments of the present invention, the calculating module 13 performs semantic similarity calculation on the detected sign information of the living body according to the disease knowledge base to obtain a first feature vector.
Further, the calculating module 13 calculates the semantic similarity between the disease knowledge base and the sign information of the living body detection through euclidean distance to obtain a second feature vector; and obtaining the feature vector of the person to be tested according to the first feature vector and the second feature vector. Specifically, characteristic information (color, character, smell, number, etc. of semen) of each category of semen of the subject in routine examination is acquired, and symptom sign information of the subject is acquired. Such as testicular distending pain, vas deferens pain, urgency, frequency, pain in urination, etc., which involve extraction of textual features and semantic similarity calculations. Here, the characteristic items are selected by TF-I DF, and a semen trait characteristic information vector set and a symptom characteristic information vector set are established.
The main ideas of TF-I DF are: if a word appears in one article with a high frequency TF and in other articles with few occurrences, the word or phrase is considered to have good category discrimination and is suitable for classification. The Term Frequency (TF) represents the frequency with which terms (keywords) appear in text. This number will typically be normalized (typically word frequency divided by the total number of articles) to prevent it from biasing toward long documents. The formula is:
namely:
if the fewer documents containing the term t, the larger the IDF, the better the category discrimination of the term is. The formula is:
where |D| is the total number of files in the corpus. I { j: ti εdj } | represents the containing word t i I.e. the number of files of ni, j +.0). If the term is not in the corpus, it will result in zero denominator, so 1+|{ j: ti εdj } | is typically used. Namely:
the denominator is added with 1 to avoid that the denominator is 0;
high term frequencies within a particular document, and low document frequencies of that term throughout the document collection, may yield a high weighted TF-IDF. Thus, TF-IDF tends to filter out common words, preserving important words. The formula is:
TF-IDF=TF*IDF;
meanwhile, the similarity of semantic relations between the feature information vector set to be identified and the feature information vector set in the database is calculated by using the cosine similarity theorem.
If there are two vectors in the n-dimensional space, vector A (a 1 ,a 2 ,a 3 ,....,a n ) Vector B (B) 1 ,b 2 ,b 3 ,....b n ),
Wherein, 1 of the vector A and the vector B can be understood as the characteristic vector of the tester in the previous embodiment; the other is the corresponding feature vector in the predictive model that matches the model.
In some embodiments of the present invention, the prediction model includes a model constructed by the method for constructing a disease prediction model based on semen routine inspection data provided in the first aspect of the present invention.
Further, the predictive model includes a trained radial basis function neural network.
Specifically, the RBF network of the present invention non-linearly maps data to a high-dimensional linear space through radial basis functions, and then fits or regresses with a linear model in the high-dimensional space. The network comprises three layers, wherein the first layer is an input layer and comprises N nodes (namely characteristics or data); the second layer is a hidden layer, M nodes are all used, and each node is an activation function for nonlinear mapping of data of the input layer to a high-dimensional space; the third layer is the output layer, where only one value is output. Here, the output of the RBF neural network is a predicted value of synthax integral, and the possible pathological changes of the semen abnormality of the subject are estimated based on the network output.
The method comprises the following specific steps: input vector X (vector corresponding to the second sample set), corresponding target output vector Y (vector corresponding to the disorder or disease), and width vector D of the radial basis function j . At the time of training of the first input sample (l=1, 2,., N), the expression and calculation method of each parameter are as follows:
1) Parameters are determined.
(1) Determining an input vector X:
X=[x 1 ,x 2 ,...,x n ] T n is the number of input layer elements;
(2) determining an output vector Y and a desired output vector O
Y=[y 1 ,y 2 ,...,y q ] T Q is the number of output layer units;
O=[o 1 ,o 2 ,...,o q ] T
(3) initializing connection weights of hidden layer to output layer
W k =[w k1 ,w k2 ,...,w kp ] T ,(k=1,2,...,q);
Where p is the number of hidden layer units and q is the number of output layer units.
The method for initializing the reference center gives a weight initialization method from the hidden layer to the output layer:
where mink is the minimum of all expected outputs in the kth output neuron in the training set; maxk is the maximum of all desired outputs in the kth output neuron in the training set.
(4) Initializing central parameters C of neurons of hidden layers j ={c j1 ,c j2 ,...,c jn } T . The centers of the neurons of different hidden layers have different values, and the corresponding width with the centers can be adjusted, so that different input information characteristics can be maximally reflected by the neurons of different hidden layers. In practical applications, an input message is always contained in a certain range of values. Without loss of generality, the initial values of the central components of the neurons of the hidden layer are changed from small to large at equal intervals, so that weaker input information generates stronger response near the smaller center. The size of the pitch can be adjusted by the number of hidden layer neurons. The method has the advantages that the reasonable hidden layer neuron number can be found through a trial and error method, the initialization of the center is reasonable as much as possible, different input features are more obviously reflected at different centers, and the characteristics of the Gaussian kernel are reflected.
Based on the four items, the initial values of the RBF neural network center parameters are as follows:
(p is the total number of hidden layer neurons, j=1, 2,..p), mini is the minimum value of all input information of the ith feature in the training set, max i The maximum value of all input information for the ith feature in the training set.
(5) Initializing width vector D j ={d j1 ,d j2 ,...,d jn } T . The width vector affects the range of neuron action on the input information: the smaller the width, the narrower the shape of the corresponding hidden layer neuron action function, and the smaller the response of the information in the vicinity of the center of the other neurons to the neuron. The calculation method comprises the following steps:
d f for the width adjustment coefficient, the value is smaller than 1, so that each hidden layer neuron can more easily realize the feeling ability to local information, and the local response ability of the RBF neural network is improved.
2) The output value zj of the jth neuron of the hidden layer is calculated.
C j Is the center vector of the jth neuron of the hidden layer, and is composed of the center components of all neurons of the hidden layer corresponding to the input layer, C j ={c j1 ,c j2 ,...,c jn } T The method comprises the steps of carrying out a first treatment on the surface of the Dj is the width vector of the jth neuron of the hidden layer, and C j Correspondingly, D j ={d j1 ,d j2 ,...,d jn } T The larger the Dj is, the larger the influence range of the hidden layer on the input vector is, and the smoothness among neurons is better; the term "normal" refers to a normal number.
3) And calculating the output of the output layer neurons.
Y=[y 1 ,y 2 ,...,y q ] T ,Wherein w is kj The weight is adjusted between the kth neuron of the output layer and the jth neuron of the hidden layer.
4) And (5) carrying out iterative calculation of the weight parameters.
The training method of the RBF neural network weight parameter is taken as a gradient descent method. The center, width and adjustment weight parameters are adaptively adjusted to the optimal values through learning, and the iterative calculation is as follows:
w kj (t) is the adjustment weight between the kth output neuron and the jth hidden layer neuron in the t-th iterative computation, c ji (t) is the central component of the jth hidden layer neuron in the t-th iterative calculation for the ith input neuron, d ji (t) is the center c ji The corresponding width of (t), η is a learning factor.
E is an RBF neural network evaluation function:wherein O is lk A desired output value for the kth output neuron at the ith input sample; y is lk Is the network output value of the kth output neuron at the ith input sample.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.
Claims (6)
1. The method for constructing the disease prediction model based on semen routine inspection data is characterized by comprising the following steps:
acquiring semen biochemical examination data, immunological examination data and vital sign information of a sample crowd to form a first sample set;
data cleaning and standardization are carried out on the first sample set according to a disease knowledge base corresponding to semen routine inspection data, so that a second sample set is formed: removing data which do not accord with biological rules and contradictory data from semen biochemical inspection data according to a disease knowledge base, normalizing the semen biochemical inspection data, and mapping the semen biochemical inspection data onto [0,1 ]; rejecting the data of which the immunological check data do not accord with the immunological rule and the data contradicting each other according to a disease knowledge base, normalizing the immunological check data, and mapping the immunological check data to [0,1 ]; according to the disease knowledge base, carrying out semantic similarity calculation on the detected vital sign information of the living body to obtain a corresponding characteristic value of the detected vital sign information of the living body, and eliminating data with low correlation with semen related diseases;
dividing the second sample set into a training set and a verification set, and then taking the training set as the input of a radial basis function neural network;
and training the radial basis function neural network until the deviation between the output value and the true value is lower than a threshold value, and obtaining a disease prediction model.
2. The method of claim 1, wherein the second sample set comprises normalized semen biochemical test data and immunological test data, and characteristic values of vital sign information of a living body test.
3. A system for predicting a disease model based on semen routine examination data is characterized by comprising an acquisition module, a storage module, a calculation module and a prediction model,
the acquisition module is used for acquiring semen biochemical examination data, immunological examination data and vital sign information of living body detection of a person to be detected;
the storage module is used for storing a disease knowledge base corresponding to the semen routine inspection data;
the calculation module is used for matching the semen biochemical examination data, the immunological examination data and the vital sign information of the living body detection of the testee with the disease knowledge base, and normalizing the semen biochemical examination data and the immunological examination data to obtain a feature vector of the testee;
the prediction model is used for predicting the disease probability of a person to be tested according to the feature vector, and comprises a model constructed by the disease prediction model construction method based on semen routine inspection data according to any one of claims 1-2.
4. The system of claim 3, wherein the computing module performs semantic similarity computation on the vital sign information of the living body detection according to the disease knowledge base to obtain a first feature vector.
5. The system of claim 4, wherein the computing module computes semantic similarity between the disease knowledge base and vital sign information of the living body test by euclidean distance to obtain a second feature vector; and obtaining the feature vector of the person to be tested according to the first feature vector and the second feature vector.
6. A system of disease prediction models based on semen routine inspection data according to claim 3, wherein the prediction models comprise trained radial basis function neural networks.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010900071.2A CN112017771B (en) | 2020-08-31 | 2020-08-31 | Method and system for constructing disease prediction model based on semen routine inspection data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010900071.2A CN112017771B (en) | 2020-08-31 | 2020-08-31 | Method and system for constructing disease prediction model based on semen routine inspection data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112017771A CN112017771A (en) | 2020-12-01 |
CN112017771B true CN112017771B (en) | 2024-02-27 |
Family
ID=73515297
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010900071.2A Active CN112017771B (en) | 2020-08-31 | 2020-08-31 | Method and system for constructing disease prediction model based on semen routine inspection data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112017771B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112233750B (en) * | 2020-10-20 | 2024-02-02 | 吾征智能技术(北京)有限公司 | Information matching system based on hemoptysis characters and diseases |
CN112908484A (en) * | 2021-01-18 | 2021-06-04 | 吾征智能技术(北京)有限公司 | System, equipment and storage medium for analyzing diseases by cross-modal fusion |
CN113393934B (en) * | 2021-06-07 | 2022-07-12 | 义金(杭州)健康科技有限公司 | Health trend estimation method and prediction system based on vital sign big data |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1997005553A1 (en) * | 1995-07-25 | 1997-02-13 | Horus Therapeutics, Inc. | Computer assisted methods for diagnosing diseases |
WO1998024369A1 (en) * | 1996-12-02 | 1998-06-11 | The University Of Texas System | Spectroscopic detection of cervical pre-cancer using radial basis function networks |
US6090044A (en) * | 1997-12-10 | 2000-07-18 | Bishop; Jeffrey B. | System for diagnosing medical conditions using a neural network |
WO2005091203A2 (en) * | 2004-03-12 | 2005-09-29 | Aureon Laboratories, Inc. | Systems and methods for treating, diagnosing and predicting the occurrence of a medical condition |
CA2747184A1 (en) * | 2010-08-06 | 2012-02-06 | Miraculins Inc. | Biomarkers for the diagnosis of prostate cancer in a non-hypertensive population |
CN104008164A (en) * | 2014-05-29 | 2014-08-27 | 华东师范大学 | Generalized regression neural network based short-term diarrhea multi-step prediction method |
CA2894317A1 (en) * | 2015-06-15 | 2016-12-15 | Deep Genomics Incorporated | Systems and methods for classifying, prioritizing and interpreting genetic variants and therapies using a deep neural network |
KR20180046432A (en) * | 2016-10-27 | 2018-05-09 | 가톨릭대학교 산학협력단 | Method and Apparatus for Classification and Prediction of Pathology Stage using Decision Tree for Treatment of Prostate Cancer |
WO2018187952A1 (en) * | 2017-04-12 | 2018-10-18 | 邹霞 | Kernel discriminant analysis approximation method based on neural network |
CN109036553A (en) * | 2018-08-01 | 2018-12-18 | 北京理工大学 | A kind of disease forecasting method based on automatic extraction Medical Technologist's knowledge |
CN110459328A (en) * | 2019-07-05 | 2019-11-15 | 梁俊 | A kind of Clinical Decision Support Systems for assessing sudden cardiac arrest |
CN110880369A (en) * | 2019-10-08 | 2020-03-13 | 中国石油大学(华东) | Gas marker detection method based on radial basis function neural network and application |
KR102100699B1 (en) * | 2019-07-01 | 2020-04-16 | (주)제이엘케이인스펙션 | Apparatus and method for constructing unified lesion learning model and apparatus and method for diagnosing lesion using the unified lesion learning model |
CN111554401A (en) * | 2020-03-26 | 2020-08-18 | 肾泰网健康科技(南京)有限公司 | Method for constructing AI (artificial intelligence) chronic kidney disease screening model, and chronic kidney disease screening method and system |
-
2020
- 2020-08-31 CN CN202010900071.2A patent/CN112017771B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1997005553A1 (en) * | 1995-07-25 | 1997-02-13 | Horus Therapeutics, Inc. | Computer assisted methods for diagnosing diseases |
WO1998024369A1 (en) * | 1996-12-02 | 1998-06-11 | The University Of Texas System | Spectroscopic detection of cervical pre-cancer using radial basis function networks |
US6090044A (en) * | 1997-12-10 | 2000-07-18 | Bishop; Jeffrey B. | System for diagnosing medical conditions using a neural network |
WO2005091203A2 (en) * | 2004-03-12 | 2005-09-29 | Aureon Laboratories, Inc. | Systems and methods for treating, diagnosing and predicting the occurrence of a medical condition |
CA2747184A1 (en) * | 2010-08-06 | 2012-02-06 | Miraculins Inc. | Biomarkers for the diagnosis of prostate cancer in a non-hypertensive population |
CN104008164A (en) * | 2014-05-29 | 2014-08-27 | 华东师范大学 | Generalized regression neural network based short-term diarrhea multi-step prediction method |
CA2894317A1 (en) * | 2015-06-15 | 2016-12-15 | Deep Genomics Incorporated | Systems and methods for classifying, prioritizing and interpreting genetic variants and therapies using a deep neural network |
KR20180046432A (en) * | 2016-10-27 | 2018-05-09 | 가톨릭대학교 산학협력단 | Method and Apparatus for Classification and Prediction of Pathology Stage using Decision Tree for Treatment of Prostate Cancer |
WO2018187952A1 (en) * | 2017-04-12 | 2018-10-18 | 邹霞 | Kernel discriminant analysis approximation method based on neural network |
CN109036553A (en) * | 2018-08-01 | 2018-12-18 | 北京理工大学 | A kind of disease forecasting method based on automatic extraction Medical Technologist's knowledge |
KR102100699B1 (en) * | 2019-07-01 | 2020-04-16 | (주)제이엘케이인스펙션 | Apparatus and method for constructing unified lesion learning model and apparatus and method for diagnosing lesion using the unified lesion learning model |
CN110459328A (en) * | 2019-07-05 | 2019-11-15 | 梁俊 | A kind of Clinical Decision Support Systems for assessing sudden cardiac arrest |
CN110880369A (en) * | 2019-10-08 | 2020-03-13 | 中国石油大学(华东) | Gas marker detection method based on radial basis function neural network and application |
CN111554401A (en) * | 2020-03-26 | 2020-08-18 | 肾泰网健康科技(南京)有限公司 | Method for constructing AI (artificial intelligence) chronic kidney disease screening model, and chronic kidney disease screening method and system |
Non-Patent Citations (3)
Title |
---|
Application of multilayer perceptron and radial basis function neural networks in differentiating between chronic obstructive pulmonary and congestive heart failure diseases;Mehrabi, S,等;EXPERT SYSTEMS WITH APPLICATIONS;第36卷(第03期);第6956-6959页 * |
基于GMM-RBF神经网络的前列腺癌诊断方法;崔少泽,等;管理科学(第01期);第33-47页 * |
系统性红斑狼疮自身抗体谱数据的解读与疾病模型预测;彭玲,等;检验医学与临床(第05期);第635-638页 * |
Also Published As
Publication number | Publication date |
---|---|
CN112017771A (en) | 2020-12-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112017771B (en) | Method and system for constructing disease prediction model based on semen routine inspection data | |
Yu et al. | Automatic classification of leukocytes using deep neural network | |
Güvenir et al. | Estimating the chance of success in IVF treatment using a ranking algorithm | |
CN109378066A (en) | A kind of control method and control device for realizing disease forecasting based on feature vector | |
CN108717867A (en) | Disease forecasting method for establishing model and device based on Gradient Iteration tree | |
CN107506579A (en) | Cerebral hemorrhage forecast model method for building up and system based on integrated study | |
Misir et al. | A reduced set of features for chronic kidney disease prediction | |
CN110232185A (en) | Towards financial industry software test knowledge based map semantic similarity calculation method | |
Assegie et al. | Exploring the performance of feature selection method using breast cancer dataset | |
Mostaar et al. | Use of artificial neural networks and PCA to predict results of infertility treatment in the ICSI method | |
Bhandari et al. | Comparative analysis of fuzzy expert systems for diabetic diagnosis | |
Shinde et al. | Analysis of WBC, RBC, platelets using deep learning | |
CN114496231A (en) | Constitution identification method, apparatus, equipment and storage medium based on knowledge graph | |
Oliver et al. | Extraction of SNOMED concepts from medical record texts. | |
Sharma et al. | Fuzzy logic: A tool to predict the Renal diseases | |
Lowongtrakool et al. | Noise filtering in unsupervised clustering using computation intelligence | |
Jabbar et al. | Risks of chronic kidney disease prediction using various data mining algorithms | |
Dardzinska et al. | Decision-making process in colon disease and Crohn’s disease treatment | |
Hossam et al. | A sub-optimum feature selection algorithm for effective breast cancer detection based on particle swarm optimization | |
CN112233742A (en) | Medical record document classification system, equipment and storage medium based on clustering | |
Heaton et al. | Repurposing trec-covid annotations to answer the key questions of cord-19 | |
Razzaq et al. | Stroke Prediction in Elderly Persons using Remote Health Monitoring | |
Junath et al. | Prognostic diagnosis for breast cancer patients using probabilistic bayesian classification | |
CN116110594B (en) | Knowledge evaluation method and system of medical knowledge graph based on associated literature | |
Tita et al. | Analyze the use of machine learning models in the Pima diabetes data set for early stage detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |