CN112017771B

CN112017771B - Method and system for constructing disease prediction model based on semen routine inspection data

Info

Publication number: CN112017771B
Application number: CN202010900071.2A
Authority: CN
Inventors: 杜乐; 杜登斌
Original assignee: Wuzheng Intelligent Technology Beijing Co ltd
Current assignee: Wuzheng Intelligent Technology Beijing Co ltd
Priority date: 2020-08-31
Filing date: 2020-08-31
Publication date: 2024-02-27
Anticipated expiration: 2040-08-31
Also published as: CN112017771A

Abstract

The invention relates to a method and a system for constructing a disease prediction model based on semen routine inspection data, wherein the method comprises the following steps: acquiring semen biochemical examination data, immunological examination data and vital sign information of a sample crowd to form a first sample set; performing data cleaning and standardization on the first sample set according to a disease knowledge base corresponding to semen routine inspection data to form a second sample set; dividing the second sample set into a training set and a verification set, and then taking the training set as the input of a radial basis function neural network; and training the radial basis function neural network until the deviation between the output value and the true value is lower than a threshold value, and obtaining a disease prediction model. The invention builds a machine learning model by utilizing Radial Basis Functions (RBFs) based on a sample set built by multiple data sources so as to predict related diseases, can be used for basic doctors to learn and reference, is convenient for early self-check and prevention of patients, and has certain popularization and application values.

Description

Method and system for constructing disease prediction model based on semen routine inspection data

Technical Field

The invention relates to the technical fields of intelligent medical treatment and medical information, relates to a method and a system for constructing a disease prediction model, and particularly relates to a method and a system for constructing a disease prediction model based on semen routine inspection data.

Background

Semen consists of sperm and seminal plasma, wherein the sperm accounts for 10 percent, and the rest is seminal plasma. It contains various enzymes and inorganic salts in addition to water, fructose, proteins and fats. Semen routine examination is primarily a preliminary laboratory examination of the volume, nature and function of semen. The content includes semen volume, color, viscosity, liquefaction time, sperm count, sperm motility, sperm morphology, semen cell examination, etc. Is mainly used for diagnosing male reproductive capacity and reproductive system diseases.

Immunological examination can determine whether autoimmune and chromosomal karyotyping is present and whether chromosomal abnormalities are present. Determination of serum FSH (follicle stimulating hormone), LH (luteinizing hormone), T (testosterone), PRL (prolactin) are important methods for oligospermia examination and also help to distinguish between primary or secondary testicular failure.

The existing diagnosis of the semen related diseases needs to rely on doctors and multiple examinations with abundant experience and strong professional ability to make an accurate diagnosis and treatment scheme. In the context of shortage of medical resources, a person to be tested or a patient usually needs to go through a period of examination and waiting time to obtain all examination results, so that uncertainty exists in the timeliness of examination data, thereby delaying the optimal diagnosis time of the patient and even causing misdiagnosis, and bringing mental loss and economic loss to the patient.

On the other hand, the medical services provided by the medical equipment resources and the professional ability of basic medical staff are limited by the shortage of basic medical institutions, and cannot meet the demands of the masses.

Disclosure of Invention

In order to relieve medical resource tension and physical examination pressure of basic medical institutions, facilitate self-checking prevention of patients and study and reference of basic doctors, the invention provides a method for constructing a disease prediction model based on semen routine examination data, which comprises the following steps: acquiring semen biochemical examination data, immunological examination data and vital sign information of a sample crowd to form a first sample set; performing data cleaning and standardization on the first sample set according to a disease knowledge base corresponding to semen routine inspection data to form a second sample set; dividing the second sample set into a training set and a verification set, and then taking the training set as the input of a radial basis function neural network; and training the radial basis function neural network until the deviation between the output value and the true value is lower than a threshold value, and obtaining a disease prediction model.

In some embodiments of the present invention, the data cleaning and standardization are performed on the first sample set according to the disease knowledge base corresponding to the semen routine inspection data, and the forming of the second sample set includes the following steps:

and eliminating data which do not accord with biological rules and contradictory data in the semen biochemical examination data according to a disease knowledge base, normalizing the semen biochemical examination data, and mapping the semen biochemical examination data to [0,1 ].

and normalizing the semen biochemical examination data according to the data of the immunological examination data which do not accord with the immunological rule and the contradictory data according to a disease knowledge base, and mapping the semen biochemical examination data onto [0,1 ].

and carrying out semantic similarity calculation on the vital sign information according to the disease knowledge base to obtain a corresponding characteristic value of the vital sign information, and eliminating data with low correlation with semen related diseases.

In the above embodiment, the second sample set includes normalized semen biochemical test data and immunological test data, and characteristic values of vital sign information of the living body test.

In another aspect of the invention, a system for predicting a disease model based on semen routine inspection data is provided, which comprises an acquisition module, a storage module, a matching module, a calculation module and a prediction model, wherein the acquisition module is used for acquiring semen biochemical inspection data, immunological inspection data and vital sign information of a person to be tested; the storage module is used for storing a disease knowledge base corresponding to the semen routine inspection data; the calculation module is used for matching the semen biochemical examination data, the immunological examination data and the vital sign information of the living body detection of the testee with the disease knowledge base, and normalizing the semen biochemical examination data and the immunological examination data to obtain a feature vector of the testee; the prediction model is used for predicting the illness probability of the testee according to the feature vector.

In some embodiments of the present invention, the calculating module performs semantic similarity calculation on the detected sign information of the living body according to the disease knowledge base, so as to obtain a first feature vector.

Further, the calculating module calculates semantic similarity between the disease knowledge base and sign information of living body detection through Euclidean distance to obtain a second feature vector; and obtaining the feature vector of the person to be tested according to the first feature vector and the second feature vector.

In some embodiments of the present invention, the prediction model includes a model constructed by the method for constructing a disease prediction model based on semen routine inspection data provided in the first aspect of the present invention.

Further, the predictive model includes a trained radial basis function neural network.

The beneficial effects of the invention are as follows:

1. according to the invention, based on the data set constructed by multiple data sources, the machine learning model is constructed by cleaning and normalizing the data set and then utilizing the Radial Basis Function (RBF), and the probability of the testee suffering from the diseases related to semen can be rapidly predicted through the machine learning model. The method can be used for basic level doctors to learn and reference, is convenient for early prediction and prevention of patients, and has certain popularization and application values.

2. The invention adopts different screening and cleaning methods aiming at different attributes of various semen inspection data, improves the effectiveness and accuracy of the data, reduces the training error and training time of the model, and thus has better robustness.

Drawings

FIG. 1 is a basic flow chart of a method of constructing a disease prediction model based on semen routine inspection data in some embodiments of the invention;

fig. 2 is a schematic structural diagram of a system for predicting a model of a disease based on semen routine inspection data in some embodiments of the invention.

Detailed Description

The principles and features of the present invention are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the invention and are not to be construed as limiting the scope of the invention.

The invention provides a method for constructing a disease prediction model based on semen routine inspection data, which comprises the following steps: s101, acquiring semen biochemical examination data and immunological examination data of sample groups, and forming a first sample set by physical sign information of living body detection; s102, carrying out data cleaning and standardization on the first sample set according to a disease knowledge base corresponding to semen routine inspection data to form a second sample set; s103, dividing the second sample set into a training set and a verification set, and then taking the training set as the input of a radial basis function neural network; s104, training the radial basis function neural network until the deviation between the output value and the true value is lower than a threshold value, and obtaining a disease prediction model.

Specifically, the biochemical parameters of each item index of semen routine examination are described below. For example, under microscopic examination, 1) White Blood Cells (WBCs) > 5/HPF, seen in genital tract inflammation (seminal vesiculitis, prostatitis), tuberculosis, tumors, etc.; 2) Red Blood Cells (RBCs) > 5/HPF, commonly found in seminal vesicle tuberculosis, prostate cancer, and the like. For another example, 1, pH: if the pH is less than 7.0, the composition is used for treating chronic infectious diseases, seminal vesicle hypofunction, congenital seminal vesicle deficiency, vas deferens obstruction and the like; 2. if the pH is more than 8.0, the patients with acute infectious diseases are mostly seen in accessory gonads or epididymis; 3. semen motility rate. If the sperm motility is less than 35%, the sperm motility is often the cause of male infertility, and is mainly found in varicocele, non-specific infection of the reproductive system, hypophysis dysfunction and the like.

The characteristic information of each type of characteristics in the semen routine inspection comprises the color, the character, the smell, the quantity and the like of the semen. For example, 1, semen color anomaly: in the case of yellow or brown purulent semen, it is common to seminal vesiculitis or prostatitis; if the semen is bloody semen with bright red, dark red or pink, the semen is mostly seen in seminal vesiculitis, prostatic tuberculosis and seminal vesiculum tumors are rare; 2. semen volume abnormality: excessive semen volume: it is often seen in oligospermia and seminal vesiculitis, and also in those with overgrowth of forbidden time; semen volume reduction: is used for treating oligospermia, testicular hypofunction, endocrine disturbance, seminal vesiculitis, prostatitis, genital system infection, etc.; semen-free fluid: is commonly seen in azoospermia; 3. abnormal semen liquefaction is usually found in the cases of prostate infection or lesions, such as the lesions of seminal vesicle glands and bulbar glands.

The sign information of the living body detection of the subject is as follows: for example, one or a combination of any several of testis distending pain, vas deferens pain, urgent urination, frequent urination, painful urination, high fever, chills, hypodynamia, waist soreness, spermatorrhea, premature ejaculation, thirst, emaciation, weakness, susceptibility to cold and the like; for example, semen may suffer from prostatitis if it is colorless and transparent, too thin, urgent, frequent, painful, high fever, chills; if semen is weak, debilitated and soreness of waist, oligospermia may occur; lean semen, distending pain in the testes, pain in the vas deferens, low back pain, which means that there may be symptoms of blood stasis. Preferably, the words or phrases are extracted by keywords, and irrelevant stop words are removed, namely the characteristic values of sign information detected by the living body of the detected person.

In step S102 of some embodiments of the present invention, performing data cleaning and standardization on the first sample set according to a disease knowledge base corresponding to semen routine inspection data, to form a second sample set includes the following steps:

Specifically, according to the clinical diagnostics standard, a biochemical parameter standard library of each item index of semen routine examination, various character characteristic information and symptom information library of semen routine examination and a possibly corresponding disease knowledge library are established through normalization. For example, semen routine examination generally involves extracting semen, and determining whether the semen volume, the sperm motility, the sperm count, the abnormal sperm volume, the semen liquefaction time, the semen pH, the total number of sperm, the sperm motility time, the sperm climbing, the erythrocyte, the leucocyte, etc. are abnormal, and whether the semen is in a normal state or an abnormal state is determined by detecting whether the semen is in a higher or lower state, and whether the semen is in an abnormal state. The method specifically comprises the following steps: the normal semen discharge value is 2-6 ml; the normal value of semen liquefaction time is: self-liquefying at 37 ℃ within 525 minutes; the pH normal value is: 7.2 to 7.8; semen motility (WHO standard): the lower limit of the reference value for sperm motility (PR+NP) was 40% and the lower limit of the reference value for forward motile sperm (PR) was 32%. The WHO standard sperm motility rate of a level, b level and c level is more than or equal to 60 percent; sperm motility (WHO standard): within 60 minutes after semen ejection, 50% or more sperm have forward motion (class a + class b), or 25% or more sperm have rapid forward motion (class a); microscopy: 1) White Blood Cell (WBC) normal value < 5/HPF; 2) Red Blood Cell (RBC) normal values < 5/HPF; 3) Sperm density: normal sperm density is around 2000-6000 ten thousand per milliliter. The above "clinical diagnostics" is only an example of a disease knowledge base corresponding to semen routine examination data, and is not to be taken as a limitation of the present invention. For example, the knowledge base of diseases related to the present invention includes "immunology" and "clinical genitalia", etc.

In another aspect of the present invention, a system for predicting a disease based on semen routine examination data is provided, which comprises an acquisition module 11, a storage module 12, a calculation module 13 and a prediction model 14, wherein the acquisition module 11 is used for acquiring semen biochemical examination data, immunological examination data and vital sign information of a living body detection of a person to be detected; the storage module 12 is used for storing a disease knowledge base corresponding to semen routine examination data; the computing module 13 is configured to match the semen biochemical inspection data, the immunological inspection data, and the vital sign information of the living body detection of the person to be tested with the disease knowledge base, and normalize the semen biochemical inspection data and the immunological inspection data to obtain a feature vector of the person to be tested; the prediction model 14 is used for predicting the disease probability of the testee according to the feature vector.

In some embodiments of the present invention, the calculating module 13 performs semantic similarity calculation on the detected sign information of the living body according to the disease knowledge base to obtain a first feature vector.

Further, the calculating module 13 calculates the semantic similarity between the disease knowledge base and the sign information of the living body detection through euclidean distance to obtain a second feature vector; and obtaining the feature vector of the person to be tested according to the first feature vector and the second feature vector. Specifically, characteristic information (color, character, smell, number, etc. of semen) of each category of semen of the subject in routine examination is acquired, and symptom sign information of the subject is acquired. Such as testicular distending pain, vas deferens pain, urgency, frequency, pain in urination, etc., which involve extraction of textual features and semantic similarity calculations. Here, the characteristic items are selected by TF-I DF, and a semen trait characteristic information vector set and a symptom characteristic information vector set are established.

The main ideas of TF-I DF are: if a word appears in one article with a high frequency TF and in other articles with few occurrences, the word or phrase is considered to have good category discrimination and is suitable for classification. The Term Frequency (TF) represents the frequency with which terms (keywords) appear in text. This number will typically be normalized (typically word frequency divided by the total number of articles) to prevent it from biasing toward long documents. The formula is:

namely:

if the fewer documents containing the term t, the larger the IDF, the better the category discrimination of the term is. The formula is:

where |D| is the total number of files in the corpus. I { j: ti εdj } | represents the containing word t _i I.e. the number of files of ni, j +.0). If the term is not in the corpus, it will result in zero denominator, so 1+|{ j: ti εdj } | is typically used. Namely:

the denominator is added with 1 to avoid that the denominator is 0;

high term frequencies within a particular document, and low document frequencies of that term throughout the document collection, may yield a high weighted TF-IDF. Thus, TF-IDF tends to filter out common words, preserving important words. The formula is:

TF-IDF＝TF*IDF；

meanwhile, the similarity of semantic relations between the feature information vector set to be identified and the feature information vector set in the database is calculated by using the cosine similarity theorem.

If there are two vectors in the n-dimensional space, vector A (a ₁ ,a ₂ ,a ₃ ,....,a _n ) Vector B (B) ₁ ,b ₂ ,b ₃ ,....b _n )，

Wherein, 1 of the vector A and the vector B can be understood as the characteristic vector of the tester in the previous embodiment; the other is the corresponding feature vector in the predictive model that matches the model.

Specifically, the RBF network of the present invention non-linearly maps data to a high-dimensional linear space through radial basis functions, and then fits or regresses with a linear model in the high-dimensional space. The network comprises three layers, wherein the first layer is an input layer and comprises N nodes (namely characteristics or data); the second layer is a hidden layer, M nodes are all used, and each node is an activation function for nonlinear mapping of data of the input layer to a high-dimensional space; the third layer is the output layer, where only one value is output. Here, the output of the RBF neural network is a predicted value of synthax integral, and the possible pathological changes of the semen abnormality of the subject are estimated based on the network output.

The method comprises the following specific steps: input vector X (vector corresponding to the second sample set), corresponding target output vector Y (vector corresponding to the disorder or disease), and width vector D of the radial basis function _j . At the time of training of the first input sample (l=1, 2,., N), the expression and calculation method of each parameter are as follows:

1) Parameters are determined.

(1) Determining an input vector X:

X＝[x ₁ ，x ₂ ，...，x _n ] ^T n is the number of input layer elements;

(2) determining an output vector Y and a desired output vector O

Y＝[y ₁ ，y ₂ ，...，y _q ] ^T Q is the number of output layer units;

O＝[o ₁ ，o ₂ ，...，o _q ] ^T

(3) initializing connection weights of hidden layer to output layer

W _k ＝[w _k1 ，w _k2 ，...，w _kp ] ^T ，(k＝1，2，...，q)；

Where p is the number of hidden layer units and q is the number of output layer units.

The method for initializing the reference center gives a weight initialization method from the hidden layer to the output layer:

where mink is the minimum of all expected outputs in the kth output neuron in the training set; maxk is the maximum of all desired outputs in the kth output neuron in the training set.

(4) Initializing central parameters C of neurons of hidden layers _j ＝{c _j1 ,c _j2 ,...,c _jn } ^T . The centers of the neurons of different hidden layers have different values, and the corresponding width with the centers can be adjusted, so that different input information characteristics can be maximally reflected by the neurons of different hidden layers. In practical applications, an input message is always contained in a certain range of values. Without loss of generality, the initial values of the central components of the neurons of the hidden layer are changed from small to large at equal intervals, so that weaker input information generates stronger response near the smaller center. The size of the pitch can be adjusted by the number of hidden layer neurons. The method has the advantages that the reasonable hidden layer neuron number can be found through a trial and error method, the initialization of the center is reasonable as much as possible, different input features are more obviously reflected at different centers, and the characteristics of the Gaussian kernel are reflected.

Based on the four items, the initial values of the RBF neural network center parameters are as follows:

(p is the total number of hidden layer neurons, j=1, 2,..p), mini is the minimum value of all input information of the ith feature in the training set, max _i The maximum value of all input information for the ith feature in the training set.

(5) Initializing width vector D _j ＝{d _j1 ,d _j2 ,...,d _jn } ^T . The width vector affects the range of neuron action on the input information: the smaller the width, the narrower the shape of the corresponding hidden layer neuron action function, and the smaller the response of the information in the vicinity of the center of the other neurons to the neuron. The calculation method comprises the following steps:

d _f for the width adjustment coefficient, the value is smaller than 1, so that each hidden layer neuron can more easily realize the feeling ability to local information, and the local response ability of the RBF neural network is improved.

2) The output value zj of the jth neuron of the hidden layer is calculated.

C _j Is the center vector of the jth neuron of the hidden layer, and is composed of the center components of all neurons of the hidden layer corresponding to the input layer, C _j ＝{c _j1 ,c _j2 ,...,c _jn } ^T The method comprises the steps of carrying out a first treatment on the surface of the Dj is the width vector of the jth neuron of the hidden layer, and C _j Correspondingly, D _j ＝{d _j1 ,d _j2 ,...,d _jn } ^T The larger the Dj is, the larger the influence range of the hidden layer on the input vector is, and the smoothness among neurons is better; the term "normal" refers to a normal number.

3) And calculating the output of the output layer neurons.

Y＝[y ₁ ，y ₂ ，...，y _q ] ^T ，Wherein w is _kj The weight is adjusted between the kth neuron of the output layer and the jth neuron of the hidden layer.

4) And (5) carrying out iterative calculation of the weight parameters.

The training method of the RBF neural network weight parameter is taken as a gradient descent method. The center, width and adjustment weight parameters are adaptively adjusted to the optimal values through learning, and the iterative calculation is as follows:

w _kj (t) is the adjustment weight between the kth output neuron and the jth hidden layer neuron in the t-th iterative computation, c _ji (t) is the central component of the jth hidden layer neuron in the t-th iterative calculation for the ith input neuron, d _ji (t) is the center c _ji The corresponding width of (t), η is a learning factor.

E is an RBF neural network evaluation function:wherein O is _lk A desired output value for the kth output neuron at the ith input sample; y is _lk Is the network output value of the kth output neuron at the ith input sample.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. The method for constructing the disease prediction model based on semen routine inspection data is characterized by comprising the following steps:

acquiring semen biochemical examination data, immunological examination data and vital sign information of a sample crowd to form a first sample set;

data cleaning and standardization are carried out on the first sample set according to a disease knowledge base corresponding to semen routine inspection data, so that a second sample set is formed: removing data which do not accord with biological rules and contradictory data from semen biochemical inspection data according to a disease knowledge base, normalizing the semen biochemical inspection data, and mapping the semen biochemical inspection data onto [0,1 ]; rejecting the data of which the immunological check data do not accord with the immunological rule and the data contradicting each other according to a disease knowledge base, normalizing the immunological check data, and mapping the immunological check data to [0,1 ]; according to the disease knowledge base, carrying out semantic similarity calculation on the detected vital sign information of the living body to obtain a corresponding characteristic value of the detected vital sign information of the living body, and eliminating data with low correlation with semen related diseases;

dividing the second sample set into a training set and a verification set, and then taking the training set as the input of a radial basis function neural network;

and training the radial basis function neural network until the deviation between the output value and the true value is lower than a threshold value, and obtaining a disease prediction model.

2. The method of claim 1, wherein the second sample set comprises normalized semen biochemical test data and immunological test data, and characteristic values of vital sign information of a living body test.

3. A system for predicting a disease model based on semen routine examination data is characterized by comprising an acquisition module, a storage module, a calculation module and a prediction model,

the acquisition module is used for acquiring semen biochemical examination data, immunological examination data and vital sign information of living body detection of a person to be detected;

the storage module is used for storing a disease knowledge base corresponding to the semen routine inspection data;

the calculation module is used for matching the semen biochemical examination data, the immunological examination data and the vital sign information of the living body detection of the testee with the disease knowledge base, and normalizing the semen biochemical examination data and the immunological examination data to obtain a feature vector of the testee;

the prediction model is used for predicting the disease probability of a person to be tested according to the feature vector, and comprises a model constructed by the disease prediction model construction method based on semen routine inspection data according to any one of claims 1-2.

4. The system of claim 3, wherein the computing module performs semantic similarity computation on the vital sign information of the living body detection according to the disease knowledge base to obtain a first feature vector.

5. The system of claim 4, wherein the computing module computes semantic similarity between the disease knowledge base and vital sign information of the living body test by euclidean distance to obtain a second feature vector; and obtaining the feature vector of the person to be tested according to the first feature vector and the second feature vector.

6. A system of disease prediction models based on semen routine inspection data according to claim 3, wherein the prediction models comprise trained radial basis function neural networks.