CN109243618B

CN109243618B - Medical model construction method, disease label construction method and intelligent device

Info

Publication number: CN109243618B
Application number: CN201811062782.6A
Authority: CN
Inventors: 陈志刚; 王万新; 苏丽娟; 孙继超
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-09-12
Filing date: 2018-09-12
Publication date: 2020-06-16
Anticipated expiration: 2038-09-12
Also published as: CN109243618A

Abstract

The embodiment of the invention discloses a construction method of a medical model, a construction method of a disease label and intelligent equipment, wherein the construction method of the medical model comprises the following steps: determining a supervision tag for a target user, and acquiring internet data associated with a user identification code of the target user; determining a training text set of a target user according to the internet data; determining medical keywords from the training text set, and optimizing a first initial model based on the medical keywords and the supervision labels to obtain a first model; meanwhile, acquiring relevant words included in the training text set, and optimizing a second initial model based on the relevant words and the supervision labels to obtain a second model; and finally, constructing a disease label model according to the first model and the second model. By adopting the embodiment of the invention, the disease label can be constructed for the Internet user.

Description

Medical model construction method, disease label construction method and intelligent device

Technical Field

The invention relates to the technical field of image processing, in particular to a medical model construction method, a disease label construction method and intelligent equipment.

Background

In the information age of today, with the rapid development of electronic technology and computer technology, machine learning becomes a research hotspot in the field of artificial intelligence. A common learning form in machine learning is supervised learning, which refers to learning a model or a function from a given training data set, and predicting a result according to the model when new data comes. In other words, supervised learning is to indicate a wrong indication during machine learning, so that the machine learning reduces errors through an algorithm. Currently, training out machine models for classification and prediction using supervised learning has become a focus of machine learning research.

Disclosure of Invention

The embodiment of the invention provides a medical model construction method, a disease label construction method and a disease label construction device, which can be used for constructing a disease label for an internet user.

In one aspect, an embodiment of the present invention provides a method for constructing a medical model, including:

determining a supervision tag for a target user, and acquiring internet data associated with a user identification code of the target user;

determining a training text set of the target user according to the internet data;

determining medical keywords from the training text set, and optimizing a first initial model based on the medical keywords and the supervision labels to obtain a first model;

acquiring relevant words included in the training text set, and optimizing a second initial model based on the relevant words and the supervision labels to obtain a second model;

and constructing a disease label model according to the obtained first model and the second model.

On the other hand, the embodiment of the invention also provides a disease label construction method, which comprises the following steps:

acquiring internet data of a user to be detected;

determining medical characteristic information from the internet data of the user to be detected, and inputting the medical characteristic information into a first model in a disease label model for identification to obtain a first identification result;

determining relevant word characteristic information from the internet data of the user to be detected, and inputting the relevant word characteristic information into a second model in the disease label model for recognition to obtain a second recognition result;

and processing the first identification result and the second identification result to obtain the disease label of the user to be detected.

In another aspect, an embodiment of the present invention further provides a device for constructing a medical model, including an obtaining unit and a processing unit:

the acquisition unit is used for acquiring internet data associated with the user identification code of the target user;

the processing unit is configured to:

determining a surveillance tag for a target user;

In another aspect, an embodiment of the present invention further provides a disease label constructing apparatus, including an obtaining unit and a processing unit:

the acquisition unit is used for acquiring the internet data of the user to be detected;

the processing unit is configured to:

determining medical characteristic information from the internet data of the user to be detected;

inputting the medical characteristic information into a first model in a disease label model for recognition to obtain a first recognition result;

determining relevant word characteristic information from the internet data of the user to be detected;

inputting the relevant word feature information into a second model in the disease label model for recognition to obtain a second recognition result;

In another aspect, an embodiment of the present invention provides an intelligent device, including: a processor and a memory for storing a computer program comprising first program instructions, the processor being configured to invoke the first program instructions to perform the above-described method of construction of a medical model; alternatively, the computer program comprises second program instructions, and the processor is configured to call the second program instructions to execute the above disease label model building method.

Correspondingly, the embodiment of the invention also provides a computer storage medium, wherein the computer storage medium stores first computer program instructions, and the first computer program instructions are used for executing the construction method of the medical model when being executed by a processor; or the computer storage medium has stored therein second computer program instructions for executing the above-mentioned disease tag construction method when executed by a processor.

According to the embodiment of the invention, after the supervision tag is determined for the target user, the Internet data corresponding to the user identification code of the target user is obtained as the training text set, the medical keywords and the associated words are determined from the training text set to respectively carry out optimization training on the first initial model and the second initial model so as to obtain the first model and the second model, and finally the disease tag model is constructed according to the first model and the second model, so that the disease tag model can be ensured to have higher accuracy and wider coverage, and the accuracy of the disease tag model for carrying out disease estimation on a new user based on the Internet data is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1a is an architecture diagram of a disease tag construction system provided by an embodiment of the present invention;

FIG. 1b is a schematic flow chart of a disease tag construction provided by an embodiment of the present invention;

FIG. 1c is a schematic flow chart of a method for constructing a medical model according to an embodiment of the present invention;

FIG. 2 is a flow chart of another method for constructing a medical model according to an embodiment of the present invention;

FIG. 3 is a flow chart of a method for optimizing a model according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart diagram of another model-based optimization method provided by the embodiment of the invention;

FIG. 5 is a schematic flow chart of a disease tag construction method according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of an apparatus for constructing a medical model according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a disease tag constructing apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an intelligent device according to an embodiment of the present invention.

Detailed Description

When the big data is used for analyzing the user behaviors so as to set corresponding labels for the user, models for classifying and predicting the user labels can be built in a machine learning mode, and after the labels are set for the user based on the models, the labels can be directly recommended or targeted services can be provided for the user based on the labels. For example, the user is recommended the goods of interest to the user based on the user goods consumption label, and the user is recommended the articles, treatment methods, treatment organizations, and the like related to the diseases identified by the user's disease label based on the user disease label. In one embodiment, for the setting of the disease label of the user, a disease label model can be trained according to data of a large number of sick users and healthy users, and then the internet data of the user to be detected is analyzed by using the disease label model to obtain the disease label of the user to be detected. Referring to fig. 1a, an architecture diagram constructed for a disease label according to an embodiment of the present invention is shown in fig. 1a, and as seen from the architecture diagram shown in fig. 1a, setting a disease label for a user includes two parts, a first part is a construction of a disease label model, and a second part is a construction of a disease label for the user. When the disease label model is constructed, a large number of patients can be selected from a medical database, such as an offline medical institution or an online registration website, and then specific diseases of the patients are labeled for each patient to obtain a disease labeling result (i.e., a supervision label). And simultaneously, selecting a large number of healthy users, and adding healthy labels to the healthy users to obtain healthy label results (namely supervision labels). The method comprises the steps of taking a sick user and a healthy user as training users, obtaining internet data of the sick user and the healthy user, taking the training users as units, taking the obtained internet data as training data, and inputting the training data into an initial disease label model respectively to obtain a data analysis result of the initial disease label model. And determining a disease label of a corresponding training user according to the data analysis result, and optimizing the initial disease label model according to the disease label determined for the training user and the label added for the training user to obtain the disease label model.

When a user disease label is set, internet data of a user to be detected is obtained, the internet data is input into the disease label model to be recognized, a recognition result is obtained, and the disease label of the user to be detected can be obtained by analyzing the recognition result.

In one embodiment, for the setting of the commodity consumption label of the user, the category of the commodity can be preset, specific articles are classified into corresponding commodity categories, and then a commodity consumption label model is obtained through the training of commodity consumption data of a large number of internet users. After the commodity consumption label model is obtained, the Internet data of a certain user can be analyzed by using the model, the user is interested in commodities of various types through analysis, and a commodity consumption label is set for the user.

The model for classification and prediction constructed by using machine learning in the embodiment of the invention can be applied to various fields, and the construction method of the medical model in the embodiment of the invention is described below by taking the construction of the disease label model as an example.

The machine learning method can capture wider user characteristics, and the population covered by the disease label model is wider. In an embodiment of the invention, a disease signature model is constructed from a medical knowledge model and an internet data analysis model (alternatively referred to as a machine learning model). The medical knowledge model adopts information such as relevant aliases of diseases, medicines, treatment modes and the like as training characteristics, and although the coverage is not high, the accuracy of the model for identifying the diseases can be ensured; the internet data analysis model is used for analyzing and identifying various internet data, such as internet reading data, information attention data, information release data, keyword search results and the like of a certain user, and can widely cover identification of various user characteristics. The embodiment of the invention adopts a method of combining two models to generate the disease label model, and can effectively meet the identification requirements of identification accuracy and user coverage rate.

Referring to fig. 1b, a schematic flow chart of constructing a disease label model according to an embodiment of the present invention is shown. In the embodiment of the invention, a large number of sick users are selected as positive sample data, a large number of healthy users are selected as negative sample data, a first initial model (also called a medical knowledge model) and a second initial model (also called a machine learning model) are respectively trained to obtain the first model and the second model, then a disease label model is constructed according to the first model and the second model, and the obtained disease label model is verified. Therefore, when the disease label model is used for identifying the diseases of the internet data of the user to be detected, the accuracy of the identification result can be ensured.

Referring to fig. 1c, a schematic structural diagram of a method for constructing a medical model according to an embodiment of the present invention is shown, in the method for constructing a medical model according to an embodiment of the present invention, first, a supervision tag of a target user is determined, for example, the supervision tag of the target user is a tag with a tumor disease, or the supervision tag of the target user is a health tag (i.e., does not have any disease). And selecting the internet data of the target user as sample data. If the supervision tag of the target user is a tag of a certain disease, the internet data of the target user is positive sample data, or if the supervision tag of the target user is healthy, the internet data of the target user is negative sample data. The training of the disease label model can be more accurate by adopting a large amount of positive and negative sample data corresponding to the user.

After sample data of a target user is obtained, a training text set of the target user is determined according to the sample data, then medical characteristic information and associated word characteristic information are respectively obtained from the training text set, the medical characteristic information and a supervision tag are input into a first initial model, the first initial model is optimized to obtain a first model, the associated word characteristic information and the supervision tag are input into a second initial model, and the second initial model is optimized to obtain a second model. And finally, constructing a disease label model according to the first model and the second model. The disease label model can be used for identifying internet data of the user to be detected and classifying and predicting diseases of the user to be detected. In one embodiment, the training text set of the target user is mainly documents generated by data such as articles read by the target user on the internet, published articles, comments published on a medical website, questions posed, and the like. From these documents, the medical keywords and the related words related to the medicine can be obtained, and further the medical characteristic information and the related word characteristic information can be obtained.

In the method for constructing a medical model shown in fig. 1c, after the disease label model is constructed according to the first model and the second model, the disease label model may be verified. The implementation of the verification of the disease signature model may be: acquiring a target verification user, and determining a verification label for the target verification user; acquiring a verification text set corresponding to a target verification user; determining medical characteristic information from the verification text set, and inputting the medical characteristic information into the first model for recognition to obtain a first recognition result; determining relevant word feature information from the verification text set, and inputting the relevant word feature information into a second model for recognition to obtain a second recognition result; determining a verification disease label of a target verification user according to the first recognition result and the second recognition result; and finally, judging whether the verification is passed according to the verification disease label and the verification label of the target verification user.

In one embodiment, a verified user set may be selected first, for example, a user with a disease and an equal number of healthy users are selected to form the verified user set, and the target verified user is any one of the users in the verified user set. The process of validating the disease label model using the validation corpus of the target validation user may be: acquiring internet data of a target verification user according to a network identification code of the target verification user, such as a mobile phone number, a mobile phone identification code and the like, and determining a verification text set from the internet data; extracting medical characteristic information and associated word characteristic information in the verification text set, and respectively inputting the extracted medical characteristic information and the extracted associated word characteristic information into a first model and a second model of the disease label model to obtain a first recognition result and a second recognition result; and finally, obtaining a disease label of the target verification user according to the first identification result and the second identification result. Judging whether the disease label of the target verification user is matched with the verification label of the target verification user: if the matching is successful, the verification of the disease label model by the target verification user is passed; if not, the verification is not passed. After the disease label model is verified by the target verification user, the disease label model is verified by other verification users in the verification user set.

Until the disease label model is verified by using all verification users in the verification user set, counting the verification passing rate, namely the accuracy rate of the disease label model, if the accuracy rate of the disease label model is greater than a preset accuracy rate threshold value, the disease label model is verified to be passed, namely the disease label model is trained, and the disease label model can be constructed for the user; if the accuracy of the disease label model is not greater than the preset accuracy threshold, the disease label model is not verified, and the disease label model needs to be retrained.

Referring to fig. 2 again, the flow diagram of the method for constructing a medical model according to the embodiment of the present invention is shown, and the method for constructing a medical model according to the embodiment of the present invention may be applied to the fields of health insurance underwriting, health insurance recommendation, intelligent medical services, and the like, and may be specifically implemented by an intelligent device, for example, a server capable of collecting network data. In other embodiments, the construction method of the medical model according to the embodiments of the present invention may also be applied to other application scenarios that require disease classification and prediction for a user.

The construction method of the medical model shown in fig. 2 is realized by a disease label model based on a supervised learning model construction mode. As shown in fig. 2, the smart device determines a surveillance tag for a target user in S201, and acquires internet data associated with a user identification code of the target user. In one embodiment, the target user may be a diseased user or a healthy user. Determining the surveillance tag for the target user may be understood as determining whether the target user is a diseased user or a healthy user. If the target user is a sick user, adding a specific disease label for the target user, wherein the disease label is used for marking the disease category suffered by the target user; if the target user is a healthy user, adding a healthy label to the target user,

the disease label classification system is relatively fixed and will not change with the change of the application scene, in one embodiment, the classification of the disease label can be determined based on the disease prevalence of residents, the severity and burden of the disease, the life stage and other factors, and is communicated with medical experts. In one embodiment, disease signatures fall into three broad categories: the disease label of children, the chronic disease label of adults and the health label of pregnant women are subdivided into a plurality of specific diseases under each large category, the detailed classification can be seen in table 1, and the classification items in table 1 can be used as specific disease labels, such as influenza label, lung cancer label and the like.

TABLE 1 disease tag Classification

In one embodiment, before determining a supervision label for a target user in S201, a user set used for medical model construction is selected first, then a plurality of users in the user set are respectively used as target users, sample data of the target users are used as training data, and a disease label model is optimally trained by using the construction method of a medical model according to the embodiment of the present invention.

In order to enable the disease label model to have higher accuracy, the constructed user set can have both sick users and healthy users. The sick users in the user set may be patients recorded as having a target disease obtained from medical databases or other databases of certain hospitals, and the healthy users in the user set may be part of users selected from users not recorded as having any disease. The disease of interest falls within any one of the disease signature categories in table 1. For example, the sick user may crawl and analyze the access record of the registration website, and extract the user identifier, such as an International Mobile Equipment Identity (IMEI) of the Mobile phone of the user, of the registration record or intention of each disease label department in table 1, and may determine the sick user according to the user identifier. For example, assuming that 1000 registered mobile IMEIs of a oncology department are extracted from a registration website, 1000 corresponding patients with oncology diseases can be found based on the 1000 registered mobile IMEIs.

After the target user is acquired and the supervision tag is determined for the target user, internet data associated with the user identification code of the target user is acquired in S201. It can be understood that the acquired internet data is sample data, if the target user is a sick user, the acquired internet data is positive sample data, and if the target user is a healthy user, the acquired internet data is negative sample data. In one embodiment, in order to ensure the accuracy of constructing the disease label model, enough internet data associated with the sick user needs to be selected as the positive sample data, and the disease of the sick user should cover part or all of the disease labels in table 1, or even more disease labels. Meanwhile, enough internet data associated with healthy users also needs to be selected as negative sample data, and the number of the positive sample data is the same as that of the negative sample data under normal conditions. In general, in order to ensure the accuracy of the disease label model, a large number of sick users and a large number of healthy users need to be selected when selecting a user set.

After the supervision label of the target user is determined and the sample data of the target user is acquired, in S202, a training text set of the target user needs to be determined according to the internet data. The training text set refers to a set of all text contents included in internet data of a target user, and the training text set can include text contents of keyword search, keyword attention, article reading, information attention, state publication and the like of the target user. It is understood that the internet data of the target user may include text content such as basic information of the target user, keyword attention, article reading, and other non-text information. Since the non-text information does not affect the model training, in order to improve the model training efficiency, a training text set of the target user for model training may be determined from the internet data, where the training text set includes text content and does not include the non-text information of the target user, and the non-text information may be, for example, content of an image, a nickname, a location, and the like.

For example, table 2 shows the acquired internet data of two target users, and in the step S202, text contents such as keyword attention, reading attention, information attention and the like may be selected to form two target user training text sets.

TABLE 2 user's Internet data

In S201 and S202, a supervision label and a training text set are respectively determined, and then the intelligent device optimally trains the first initial model and the second initial model based on the supervision label and the training text set to obtain the first model and the second model. For the optimization of the first initial model, in S203, medical class keywords are determined from the training text set, and the first initial model is optimized based on the medical class keywords and the supervised labels to obtain a first model. The medical-like keywords refer to the professional names such as cancer, drug or treatment (e.g., chemotherapy, radiotherapy, etc.) keywords directly related to various diseases included in the disease labels of table 1. In one embodiment, the characteristic words of the diseases included in the various disease labels can be determined by communicating with a professional doctor and/or by medical ethnicity, and then a medical class characteristic word set is established. And then searching whether the training text set comprises the words belonging to the medical class characteristic word set or not, and taking the words belonging to the medical class characteristic word set as medical class keywords.

In one embodiment, the implementation of optimizing the first initial model based on the medical class keywords and the supervised labels in S203 may be: generating medical characteristic information based on the medical keywords; the first initial model is optimized according to the medical class feature information and the supervision labels. In an embodiment, fig. 3 is a flowchart of a method for optimizing a first initial model according to an embodiment of the present invention, where optimizing the first initial model as shown in fig. 3 may include: s301, determining a medical class feature word set according to the disease label; s302, extracting medical keywords included in the training text set based on the medical feature word set, and generating medical feature information corresponding to the medical keywords; s303, optimizing the first initial model according to the medical characteristic information and the supervision label. The medical characteristic information comprises medical keywords and corresponding keyword characteristic values, and the keyword characteristic values are used for representing the importance levels of the medical keywords in the training text set. In one embodiment, the keyword feature value corresponding to the medical keyword may be the number of times that the medical keyword appears in the training text set, for example, the medical keyword is "breast cancer", and it is assumed that a text segment of the target user is: the "effect of cancer intervention therapy for breast cancer" is that the characteristic value of the keyword corresponding to the medical keyword is 1. In one embodiment, the process of building and optimizing the first initial model may be as described in detail below.

In one embodiment, the determining of the medical class feature word set according to the disease label in S301 may be implemented by: keywords such as professional names, drugs or treatment methods associated with various diseases identified by the disease tags communicated by medical experts; and then establishing a medical class characteristic word set according to various diseases and the keywords related to the diseases, wherein the medical class characteristic word set comprises medical class characteristic words related to various diseases identified by the disease labels. For example, the keywords included in the medical class feature word set associated with the oncology disease tag obtained through communication with the oncology specialist are: tumors, cancers, leukemias, chemotherapies, radiation therapy, resection, metastasis, targeting, early, late, survival, non-small cells, apatinib, sorafenib, capecitabine, temozolomide, and the like. Based on these keywords in the medical-class feature word set, medical-class keywords can be determined from texts such as "cancer effect of breast cancer intervention therapy".

In an embodiment, the implementation manner of extracting the medical class keywords included in the training text set based on the medical class feature word set and generating the medical class feature information corresponding to the medical class keywords in S302 may be: extracting regular expressions according to the characteristics corresponding to each disease identified by the disease labels in the medical class characteristic word set, and extracting medical class keywords from the training text set; acquiring the occurrence frequency of each extracted medical key word in the training text set as a key word characteristic value corresponding to the medical key word; and generating medical class characteristic information based on the medical class keywords and the corresponding keyword characteristic values. The regular expression is a "regular character string" composed of specific characters defined in advance and a combination of the specific characters, and used for expressing a filtering logic of the character string, and is a text pattern describing one or more character strings to be matched when searching for text.

The medical characteristic word set comprises various diseases identified by the three mentioned disease label classifications, and keywords such as professional names, medicines or treatment methods related to the various diseases, and a set of characteristic extraction regular expressions is preset for the various diseases, so that the medical characteristic word set can comprise a plurality of sets of characteristic extraction regular expressions. And matching all text contents in the training text set with each feature extraction regular expression in the medical class feature word set, and obtaining matched medical class feature words, namely the medical class keywords of the training text set.

For example, the various diseases identified by the tumor disease labels and the feature extraction regular expressions corresponding to the various diseases may be as shown in table 3, in each feature extraction regular expression shown in table 3, "|" indicates or operation, "-" indicates matching any character except for carriage return and line feed, "-" indicates matching the previous sub-expression an arbitrary number of times, and "()" indicates matching and acquiring the content in parentheses. Assuming that a section of text content included in the training text set is "cancer effect of breast cancer interventional therapy", the section of text content is matched with each feature extraction regular expression included in table 3, and the matched medical feature words are acquired as 'cancer' and 'breast cancer', and then the 'cancer' and the 'breast cancer' are taken as medical keywords.

TABLE 3 regular expressions for feature extraction of various diseases under tumor markers

Tumor label	Feature extraction regular expression
		Stomach cancer	(stomach \| pylorus \| cardia). \| (carcinoma \| tumor \| (malignancy) \| sarcoma)
Liver cancer	(biliary \| liver \| common bile duct \| gallbladder). \| (carcinoma \| tumor \| (malignancy. \| tumor) \| sarcoma)
		Lung cancer	(lung \| trachea) \| (carcinoma \| tumor \| (malignancy) \| sarcoma)
Breast cancer	(breast \| mammary gland \| nipple \| paramammary areola). \| (carcinoma \| tumor \| (malignancy. \| tumor) \| sarcoma)
		Leukemia (leukemia)	Leukemia bone marrow transplantation leukemia
Uterine cancer	Uterus (carcinoma \| tumor \| (malignancy) \| sarcoma)
		Esophageal cancer	(esophagus \| esophagus) \| (carcinoma \| tumor \| (malignancy) \| sarcoma)
Cervical cancer	Cervix uteri (carcinoma \| tumor \| (malignant tumor) \| sarcoma)

In one embodiment, if the surveillance tag includes a tag for identifying a target disease, the implementation of S303 optimizing the first initial model according to the medical class feature information and the surveillance tag may be: the medical characteristic information is used as an input parameter of a first initial model, and a disease identification result output by the first initial model is obtained; optimizing an initial model if the disease indicated by the disease identification result does not match the target disease. In this case, the target disease refers to a disease indicated by the supervision tag determined for the target user, for example, the target user is a user a, the supervision tag of the user a is that the user has lung cancer, and the target disease refers to lung cancer at this time.

If the medical characteristic information is input into the first initial model as an input parameter, and the obtained disease identification result is not matched with the target disease, it indicates that the first initial model cannot accurately construct a disease label for the target user according to the medical characteristic information of the target user, that is, the first initial model cannot learn the medical characteristic information of the target user, the parameter in the first initial model needs to be adjusted, and then the target medical characteristic information is input into the first initial model after the parameter adjustment for identification until the disease indicated by the identification result output by the first initial model is matched with the target disease, which indicates that the optimization of the first initial model according to the training text set of the target user has been completed. And then, continuously optimizing the optimized first initial model by using the training text set of the next target user in the user set until the first initial model is optimized according to the training text sets of all the target users in the user set, so as to obtain the first model.

In another embodiment, if the supervision labels include health labels, the implementation manner of S303 optimizing the first initial model according to the medical class feature information and the supervision labels may be: inputting the medical characteristic information into a first initial model as an input parameter, and acquiring a disease identification result output by the first initial model; if the disease recognition result output by the first initial model does not match the supervision label, that is, the disease recognition result output by the first initial model cannot identify that the target user is a healthy user who does not have any disease, for example, the disease recognition result indicates that the user has diabetes, but the supervision label includes a healthy label, and the actual situation indicates that the target user is healthy, the first initial model is optimized; if the disease recognition result output by the first initial model is matched with the supervision label, namely the disease recognition result output by the first initial model identifies that the target user does not have any disease, the training of the first initial model by using the training text set of the target user is successful.

For the optimization of the second initial model, in S204, the associated word included in the training text set is obtained, and the second initial model is optimized based on the associated word and the supervision tag to obtain the second initial model. In one embodiment, the associated words may include medical characteristic words corresponding to each disease identified by the disease tag in table 1, and terms and medical general words related to each disease, such as hospital, medical, chinese medicine, health preserving, health, symptom, differentiated adenocarcinoma, survival time, and the like. Optionally, in S204, the second initial model may be optimized according to the related word feature information and the supervision tag.

Referring to fig. 4, a flowchart of a method for optimizing a second initial model is provided according to an embodiment of the present invention. The method of optimizing the second initial model as shown in fig. 4 may comprise: s401, obtaining a sample word feature set of the training text set; s402, screening the sample word feature set to obtain a keyword feature set; and S403, optimizing a second initial model according to the associated word feature set and the supervision label. In one embodiment, the related word feature set includes related word feature information, the related word feature information includes related words and related word feature values thereof, and the related word feature values are determined according to word frequencies of the related words in corresponding target texts.

In an embodiment, the process of establishing and optimizing the second initial model and the process of establishing and optimizing the first initial model may be implemented based on an XGBoost (extreme Gradient Boosting) algorithm, and the process of establishing and optimizing the first initial model and the second initial model based on the XGBoost algorithm is described in detail below.

The XGBoost algorithm is an improvement of a Boosting (a kind belonging to an ensemble classifier) algorithm based on a GBDT (Gradient Boosting Decision Tree), and a regression Tree is used as a Decision Tree in the XGBoost algorithm. The basic principle of the XGBoost algorithm is to combine multiple weak classifier iterations into one strong classifier, and each iteration is to reduce the residual (residual) of the previous time.

In one embodiment, the basic principle of optimizing a first initial model based on the XGBoost algorithm to obtain the first model is to construct the first initial model according to each medical class feature information extracted from a training text set of a target user, where the first initial model can be understood as a weak classifier, and then obtain a disease identification result through the identification processing of the medical class feature information by the first initial model; if the target disease identified by the disease identification result does not match the supervision label of the target user, the current weak classifier needs to be iterated. The specific iterative process can be understood as: according to the difference (also called residual) between the disease identification result and the supervision label, adjusting the first initial model parameters, for example, the model parameters may refer to the weight values of the feature information of each medical class, and then establishing a new first initial model based on the adjusted model parameters in the gradient direction capable of reducing the residual, which is also equivalent to the new first initial model after optimizing the first initial model. After each iteration is finished, whether the target disease identified by the disease identification result of the optimized first initial model to the medical characteristic information of the target user is matched with the supervision label or not can be judged, and if the target disease is not matched with the supervision label, the iteration process is repeatedly executed; and if so, finishing the optimization of the first initial model to obtain the first model.

In one embodiment, the basic principle of optimizing the second initial model based on the XGBoost algorithm to obtain the second model is to construct the second initial model according to feature information of each associated word extracted from a training text set of a target user, where the second initial model can be understood as a weak classifier, and then obtain a disease recognition result through recognition processing of the feature information of the associated word by the second initial model; if the target disease identified by the disease identification result does not match the supervision label of the target user, the current weak classifier needs to be iterated. The specific iterative process can be understood as: according to the difference (also called residual) between the disease identification result and the supervision label, adjusting the second initial model parameters, for example, the model parameters may refer to the weight value of each associated word feature information, and then establishing a new second initial model based on the adjusted model parameters in the gradient direction capable of reducing the residual, which is also equivalent to a new second initial model after optimizing the second initial model. After each iteration is finished, whether the target disease identified by the disease identification result of the associated word feature information of the target user by the optimized second initial model is matched with the supervision label or not can be judged, and if the target disease is not matched with the supervision label, the iteration process is repeatedly executed; and if so, finishing the optimization of the second initial model to obtain a second model.

In one embodiment, the implementation manner of S401 may be: and performing word segmentation on each text in the training text set, and removing stop words and dummy words in word segmentation results to obtain a sample word feature set of the training text set. As a possible implementation, S401 may perform word segmentation on each text in the training text set by using a jieba chinese word segmentation method, and the word segmentation mode may use a full mode method, such as that a section of the text in the training text set is "these foods help prevent lung cancer", the word segmentation result is [ 'these', 'foods', 'help', 'prevent', 'lung cancer', "after the word segmentation result of the text is obtained, the meaningless stop words and null words in the word segmentation result are removed to obtain the word feature set of the section of text as [ 'foods', 'help', 'prevent', 'lung cancer'). As another possible implementation, S401 may also use segmentation and one-hot encoding, that is, a bag-of-words mode to perform segmentation processing on each text in the training text set.

In one embodiment, in S401, after performing word segmentation processing on each text in the training text set to obtain a sample word feature set, further obtaining each sample word feature value in the sample word feature set. The sample word feature value reflects the importance level of the sample word in the training text set. Optionally, the intelligent device may obtain each sample word feature value in the sample word feature set through a TF-IDF (word frequency-inverse document frequency) algorithm. The TF-IDF is a statistical method for evaluating the importance of a word to a piece of text, and can digitize each sample word in the sample word feature set. In one embodiment, the manner of obtaining the feature value of each sample word in the sample word feature set may be: determining a target text to which a target sample word belongs in a sample word feature set, wherein the target text is a certain section of text in a training text set; calculating the word frequency of the target sample word in the target text; calculating the inverse document frequency of the target sample word in the training text set; and obtaining a sample word characteristic value of the target sample word according to the word frequency and the inverse document frequency of the target sample word.

In one embodiment, calculating the word frequency of the target sample word in the target text comprises: counting sample words (including target sample words) included in the target text and the occurrence frequency of each sample word, so as to calculate the sum of the occurrence frequency of all sample words in the target text; and calculating the word frequency of the target sample word in the target text according to the sum of the occurrence frequency of the target sample word in the target text and the occurrence frequency of all sample words in the target text. For example, the target sample word is lung cancer, the target text is "these foods help prevent lung cancer", the word feature set of the target text obtained after the word segmentation processing is performed on the target text is [ 'food', 'help', 'prevent', 'lung cancer', ] each of the four sample words appears 1 time, the total number of target texts is 4, lung cancer appears 1 time, and therefore the word frequency of lung cancer is 1/4.

In one embodiment, calculating an inverse document frequency of a target sample word in the training text set comprises: determining the number of the text sets containing the target sample words in the training text set, and calculating the inverse document frequency of the target sample words in the training text set according to a formula log (the number of all texts/the number of texts containing the target sample words + 1). The number of all texts is the number of all texts in the training text set, the number of texts containing the target sample words is the number of texts containing the target sample words in all texts in the training text set, and 1 is added to ensure that the denominator is not 0. For example, if a training text set obtained from internet data of a target user includes 1000 texts, and the number of lung cancer occurrences in these texts is 20, the inverse document frequency is log (1000/21).

And after the word frequency and the inverse document frequency of a certain target sample word in the target text are obtained, multiplying the word frequency of the target sample word appearing in the target text by the inverse document frequency in the training text set to obtain a sample word characteristic value of the target sample word. It is to be understood that the sample word feature values of other sample words in the sample word feature set may be obtained by the same method as the method for calculating the sample word feature value of the target sample word.

In one embodiment, the S402 screening the sample word feature set may include a primary screening and a secondary screening. The preliminary screening may refer to deleting sample words which are meaningless to the training text set in the sample word feature set, where the meaningless sample words may be sample words which appear too many times or appear too few times in the training text set; the secondary screening may refer to selecting a preset number of sample words with higher importance levels from a set of screened sample words obtained after the primary screening. In one embodiment, the step S402 of screening the sample word feature set to obtain a related word feature set includes: performing initial screening on the sample word feature set to obtain a screened sample word set; sequencing all sample words in the screened sample word set according to the importance levels in the training set, and selecting the top N sample words as associated words according to a sequencing result, wherein N is a positive integer greater than 1; and generating a related word feature set according to the obtained related words.

In one embodiment, the initially screening the sample word feature set includes: deleting the first type sample words and/or the second type sample words; the first type of sample word comprises: sample words with word frequency larger than a first word frequency threshold value in the training text set or with times larger than a first time threshold value in the training text set; the second type of sample word comprises: and sample words with the word frequency smaller than a second word frequency threshold value in the training text set or with the frequency smaller than a second frequency threshold value in the training text set. In other words, the intelligent device may preset sample word screening conditions, and delete sample words in the sample word feature set that do not meet the sample word screening conditions. The first word frequency threshold is greater than the second word frequency threshold, and the first number threshold is greater than the second number threshold, for example, the first threshold may be 60% and the second threshold may be 1%.

In one embodiment, the basis of the deleted first type sample word and/or the second type sample word may be the word frequency of the sample word in the training text set. The word frequency of a certain sample word appearing in the training text set refers to a ratio of the number of times of the sample word appearing in the training sample set to the total number of times of all sample words appearing in the training sample set, for example, if the sample word is "application", the number of times of all sample words and each sample word appearing in the training sample set is (application 10, requirement 2, threshold 3, and scheme 4), then the total number of times of the sample word is 10+2+3+4 ═ 19 times, if the sample word appears in the training sample set as 5 words, then the word frequency of the sample word is 5/19.

In one embodiment, the basis for deleting the first type sample words and/or the second type sample words may also be the number of times the sample words appear in the training text set, and sample words that appear too many or too few may be understood as words that are meaningless for the analysis of the training text set. For example, "yes", "excellent" and the like serve as verbs or modifying components in texts, the meaning of the texts is not large in the analysis of the texts, but the words are almost included in any texts, so the words appear too many times in the training text set, and the words appearing more times are deleted in order to improve the efficiency of model training or model recognition. Similarly, the sample words with less occurrence times in the training text set have little analytical significance on the training text set, and can be deleted.

In one embodiment, in S402, the sorting of each sample word in the filtered sample word set according to the importance level in the training set, selecting the first N sample words as related words according to the sorting result, and generating a related word feature set according to the obtained related words may be understood as sorting according to sample word feature values of each sample word in the filtered sample word set from large to small, then selecting the first N sample words as related words, using sample word feature values of the first N sample words as related word feature values corresponding to the related words, and finally generating a related word feature set according to the related words and the related word feature values.

As can be seen from the above description, after performing word segmentation processing on each text in the training text set in S401 to obtain a sample word feature set, each sample word feature value in the sample word feature set may be obtained, that is, each sample word in the sample word feature set corresponds to one sample word feature value, and the sample word feature value is used to represent an importance level of the sample word in the training text set. When optimizing the second initial model, if the surveillance tag includes a disease tag for identifying a target disease, in S403, optimizing the second initial model according to the keyword feature set and the surveillance tag includes: taking the relevant word feature information included in the relevant word feature set as an input parameter of a second initial model, and obtaining a disease identification result output by the second initial model; and if the disease indicated by the disease identification result output by the second initial model does not match the target disease, optimizing the second initial model. If the disease indicated by the disease recognition result output by the second initial model is matched with the target disease, the second initial model can accurately recognize the disease label of the target user according to the training text set of the target user, that is, the second initial model is successfully optimized by using the training text set of the target user, and then the second initial model is continuously optimized by using the training text sets of other target users in the user set until the second initial model is successfully optimized by using the training text sets of all the target users in the user set, so that the second model is obtained.

For example, if the monitoring label of the target user A is leukemia, the related word feature set is obtained by performing steps of word segmentation, screening and the like on the training text set of the target user A, the related word feature information included in the related word feature set of the target user A is input into the second initial model as an input parameter, and if the disease indicated by the output result is leukemia, the second initial model can correctly identify the data of the user A, and the optimization for the user A is not needed. If the disease indicated by the output result is not leukemia and is not matched with the supervision label, the fact that the second initial model cannot correctly identify the data of the user A and cannot correctly determine the disease label for the user A indicates that parameters of the second initial model need to be adjusted to optimize the second initial model. In one embodiment, adjusting the parameters of the second initial model may refer to adjusting the weight values of the feature information of each related word, and in other embodiments, adjusting the parameters of the second initial model may also refer to adjusting the maximum depth of each tree in the second initial model, and so on.

In another embodiment, if the supervised tags include health tags, optimizing a second initial model based on the keyword feature set and the supervised tags includes: taking the relevant word feature information included in the relevant word feature set as an input parameter of a second initial model, and obtaining a disease identification result output by the second initial model; if the disease recognition result is inconsistent with the result identified by the supervision tag, for example, the disease recognition result indicates that a target user has a certain disease and is not matched with the user whose target user is a health tag, the second initial model needs to be optimized. That is, if the target user is a healthy user, the second initial model is trained using the same procedure as if the target user is a diseased user.

In the method for constructing a medical model shown in fig. 2, after the first model and the second model are obtained through S203 and S204, a disease signature model is constructed according to the first model and the second model in S205. In one embodiment, after the disease label model is constructed, the disease label model can be verified, and if the verification is passed, the construction of the disease label model is successful; if the verification fails, indicating that the disease label model is failed to be constructed, the steps S201-S205 can be executed again to optimize the disease label model.

In one embodiment, the way to validate the disease signature model may be: selecting ill users from the registered population of each department in offline medical science, selecting healthy users with the same number, and taking the ill users and the healthy users as users to be verified; acquiring internet data of each user to be verified; and calling a disease label model to identify the Internet data of each user to be verified to obtain the disease label of each user to be verified. And respectively matching the disease label of each user to be verified with the supervision label of the user to be verified, and if the matching is successful, indicating that the identification is successful. And if the identification success rate exceeds the preset value, indicating that the verification is successful. Otherwise, the verification is failed, and the optimization of the disease label model by the steps S201-S205 needs to be executed again.

For example, if 10000 offline tumor medical registered users and 10000 healthy users are selected as the users to be verified, if the recognition success rate reaches 73%, it indicates that the training of the disease label model is successful.

In summary, in the embodiment of the present invention, after the supervision tag is determined for the target user, the internet data corresponding to the user identifier of the target user is obtained as sample data, the medical characteristic information is searched from the sample data, and optimizes the first sub-model based on the searched medical class characteristic information and the supervision label of the target user, and at the same time, performing word segmentation on the text corresponding to the sample data to obtain associated words, optimizing a second submodel based on the associated words and the supervision label of the target user, the first sub-model and the second sub-model are optimized respectively by internet data and a supervision tag of a target user, the disease label model obtained based on the first sub-model and the second sub-model can be ensured to have higher accuracy and wider coverage, therefore, the accuracy of the disease label model in disease classification and prediction of new internet data is improved.

Fig. 5 is a schematic flow chart of a disease label constructing method according to an embodiment of the present invention. The disease label construction method shown in fig. 5 is to identify internet data of a user to be detected based on the disease label model obtained by the medical model construction method shown in fig. 2, so as to obtain a disease label of the user to be detected. The label construction method shown in fig. 5 can be applied to many industries, such as the intelligent medical industry, and after the disease label of the user is constructed, an appropriate treatment scheme can be recommended for the user.

In the disease label construction method shown in fig. 5, first, internet data of a user to be detected is acquired in S501. In one embodiment, the manner of acquiring the internet data of the user to be detected is as follows: determining the network identification code of the user to be detected, and acquiring the internet data associated with the network identification code of the user to be detected. Optionally, the internet data may include any one or more of internet reading data, information attention data, information distribution data, and keyword search results. The network identification code may be an account number registered by a certain user on the internet, for example, an account number of the user at a certain medical website, or may be an identifier such as an IMEI or a telephone number of the user.

After the internet data of the user to be detected is acquired, in S502, medical characteristic information is determined from the internet data of the user to be detected, and the medical characteristic information is input into a first model in the disease label model for identification, so as to obtain a first identification result. In one embodiment, the medical-class feature information includes medical-class keywords and corresponding keyword feature values, and the medical-class keywords and the corresponding keyword feature values are input into the first model, which identifies the medical-class keywords and the corresponding keyword feature values. The specific implementation of the first model may refer to the description of relevant contents in the above embodiments.

In one embodiment, the medical class feature information is input into the disease label model for recognition, a first initial recognition result can be obtained, and then the first recognition result is determined based on the first initial recognition result. Optionally, the first initial recognition result may indicate a possible disease of the user and a probability of the disease, and the first recognition result of the user to be detected may be determined based on a magnitude of the probability. In one embodiment, based on the probabilities of the respective diseases included in the first initial recognition result, determining the first recognition result of the user to be detected may be implemented by: and determining the disease with the highest probability in the first initial recognition result as the first recognition result. For example, the first initial recognition result is (lung cancer 60%, cold 20%, leukemia 5%.. multidot.. multidot..) and finally the lung cancer with the highest probability is taken as the first recognition result of the user to be detected. In another embodiment, the determining the first recognition result of the user to be detected based on the probabilities of the diseases included in the first initial recognition result may further include: a first probability threshold is preset, and then diseases with the probability larger than the first probability threshold in the first initial recognition result are determined as the first recognition result. For example, assuming that the first initial recognition result is (lung cancer 60%, cold 52%, leukemia 5%.. multidot.. multidot..) and assuming that the first probability threshold is 50%, lung cancer and cold greater than 50% of the probability threshold in the recognition results can be taken as the first recognition result.

In S503, relevant word feature information is determined from the internet data of the user to be detected, and the relevant word feature information is input into a second model in the disease tag model for recognition, so as to obtain a second recognition result. In one embodiment, the related word feature information includes related words and related word feature values corresponding to the related words, and the related word feature information is input into the second model for recognition, that is, the second model is used for predicting the probability that the user to be detected may suffer from some diseases. The second model can be implemented as described in the foregoing embodiments.

In one embodiment, the related word feature information is input into a second model in the disease label model for recognition, a second initial recognition result can be obtained firstly, and then the second recognition result can be obtained according to the probability of various diseases in the second initial recognition result. Wherein the second initial identification result comprises the disease which the user to be detected may have and the probability of having a certain disease. In one embodiment, deriving the second recognition result according to the probabilities of the various diseases in the second initial recognition result may include: and determining the disease with the highest probability in the second initial recognition result as the second recognition result. For example, the second initial recognition result may be (lung cancer 50%, cold 18%, leukemia 2%), and finally lung cancer is used as the second recognition result of the user to be detected. In yet another embodiment, deriving the second recognition result according to the probabilities of the various diseases in the second initial recognition result may include: and presetting a second probability threshold, and then determining the diseases with the probability greater than the second probability threshold in the second initial recognition result as the second recognition result. For example, assuming that the second initial recognition result is (lung cancer 10%, cold 55%, leukemia 60%), the second probability threshold is 40%, and then leukemia and cold with a probability of more than 40% among the recognition results are determined as the second recognition result. Optionally, the first probability threshold and the second probability threshold may be the same or different, and the selection of the first probability threshold and the second probability threshold may be determined according to the recognition accuracy of the first model and the second model. The first and second probability thresholds may be set larger if the accuracy of the first and second models is higher, and smaller if the accuracy of the first and second models is not sufficiently high.

According to the disease label construction method provided by the embodiment of the invention, the data of the user to be detected needs to be identified and predicted through the first model and the second model respectively to obtain the first identification result and the second identification result, and finally the first identification result and the second identification result are processed in S504 to obtain the disease label of the user to be detected, so that the accuracy of the disease label prediction of the user is ensured. As described above, if the diseases indicated by the first recognition result and the second recognition result are both lung cancer, it can be considered that the user to be detected has lung cancer, and the disease label of lung cancer is directly set for the user to be detected.

In one embodiment, the implementation of S504 may be: and carrying out weighted average operation on the probability of each disease identified by the first identification result and the probability identification result of the corresponding disease identified by the second identification result, wherein the operated diseases and the probability of the diseases form the disease label of the user to be detected. In one embodiment, the number of the disease tags of the user to be detected may be one, and in other embodiments, the disease tags of the user to be detected may also be a disease tag set, that is, the number of the disease tags of the user to be detected may be multiple. If the disease label of the user to be detected is a disease label set, all the disease labels in the disease label set can be used as the disease labels of the user to be detected; and selecting diseases with the occurrence probability larger than the disease probability threshold from the disease label set to determine the diseases labels of the users to be detected.

For example, if the first recognition result obtained through S501-S503 is lung cancer and the second recognition result is lung cancer, the disease label of the user to be detected is lung cancer. If the first identification result obtained through S501-S503 is (lung cancer 60%, cold 40%), and the second identification result is (lung cancer 40%, cold 70%), processing each disease in the first identification result and the corresponding disease in the second identification result to obtain a disease label set (lung cancer 50%, cold 55%) of the user to be detected, in this case, both the lung cancer and the cold can be used as disease labels of the user to be detected; or setting the disease probability threshold value to be 50%, and selecting cold from the tag set of the user to be detected as the disease tag of the user to be detected.

The first model is obtained by training medical keywords of relative specialties such as disease names related to diseases identified by the disease labels, and therefore the first model can accurately predict the ill condition of the user to be detected. However, when the first model identifies and predicts the internet data of the user to be detected, certain defects also exist: because the medical keywords used in training the first model have high specialization, if the extracted feature information included in the internet data of the user to be detected includes other associated word feature information associated with medicine besides the medical feature information, the first model cannot identify the associated word features, and even the first model may identify the associated feature information as the non-diseased features.

Therefore, the relevant word features are identified through the second model in the embodiment of the invention. The second model is obtained by training a large amount of relevant word feature information, namely, compared with the first model, the second model can identify more disease-related features, including medical key word features and relevant word features related to medical categories. Therefore, the first model can identify the more professional medical key word characteristics to ensure the accuracy of the disease identification result, and the second model can identify the more professional medical key word characteristics and the non-professional medical related associated word characteristics to reduce the error rate of the disease identification result. Therefore, the first model and the second model are used for identifying the internet data of the user to be detected at the same time, and then the identification results of the two times are processed to obtain the disease label of the user to be detected, so that the accuracy of the disease label constructed for the user to be detected is fully ensured. Optionally, the processing of the first recognition result and the second recognition result may include summing and averaging, or may be in another processing manner, which is not limited in the embodiment of the present invention.

For example, assuming that the internet data of the user to be detected is identified through the first model, the obtained first identification result may be (lung cancer 60%, cold 40%); assuming that the internet data of the user to be detected is identified through the second model, the obtained second identification result may be (lung cancer 42%, cold 70%), and the preset disease probability threshold is assumed to be 50%. Averaging the disease identification results in the first identification result set and the disease identification results in the second identification result set to obtain disease labels (lung cancer 51% and cold 55%) of the user to be detected, and screening the disease identification results with the probability of more than 50% in the disease labels to obtain the disease label set (lung cancer 51% and cold 50%) of the user to be detected.

Analyzing the above example, it can be seen that if only the first model is selected to construct the disease tag for the user to be detected, and the disease recognition result is screened according to the disease probability threshold of 50%, the disease tag of the user to be detected may not include the cold tag; if the second model is only selected to construct a disease label for the user to be detected, and the disease recognition result is screened according to the disease probability threshold, the disease label of the user to be detected may not include the lung cancer label. Therefore, in both cases, the disease label constructed for the user to be detected is inaccurate, and the reason is as follows: for the identification of the cold tag, although the user to be detected is suffering from the cold, the user to be detected searches fewer medical keywords directly related to the cold on the internet, and may search more relevant words related to cold diseases, so that the probability that the user to be detected who is identified by the first medical model suffers from the cold is lower than the probability that the user to be detected who is identified by the second model suffers from the cold. For the identification of the lung cancer label, a user to be detected searches for more medical keywords directly related to lung cancer on the internet, so that the probability that the user to be detected has lung cancer is higher when the first model identifies the user to be detected, and the probability that the user to be detected has lung cancer is lower when the second model identifies internet data of the user to be detected possibly influenced by other characteristics.

In the embodiment of the invention, after the supervision label is determined for the target user, the internet data corresponding to the user identification of the target user is obtained as the training text set, the medical keywords and some associated words are determined from the training text set to respectively carry out optimization training on the first initial model and the second initial model so as to obtain the first model and the second model, and finally the disease label model is constructed according to the first model and the second model, so that the disease label model can be ensured to have higher accuracy and wider coverage, and the accuracy of the disease label model for carrying out disease estimation on a new user based on the internet data is improved.

Based on the description of the above method embodiment, in one embodiment, an embodiment of the present invention further provides a schematic block diagram of a structure of a medical model building apparatus as shown in fig. 6. As shown in fig. 6, the image rendering apparatus in the embodiment of the present invention includes an obtaining unit 601 and a processing unit 602, and in the embodiment of the present invention, the apparatus for constructing a medical model may also be disposed in an intelligent device that needs to construct a model.

In one embodiment, the obtaining unit 601 is configured to: acquiring internet data of a user to be detected; the processing unit 602 is configured to: determining a surveillance tag for a target user; determining a training text set of the target user according to the internet data; determining medical keywords from the training text set, and optimizing a first initial model based on the medical keywords and the supervision labels to obtain a first model; acquiring relevant words included in the training text set, and optimizing a second initial model based on the relevant words and the supervision labels to obtain a second model; and constructing a disease label model according to the obtained first model and the second model.

In an embodiment, the supervised labels comprise disease labels for identifying a target disease, and the implementation of the processing unit 602 for determining medical class keywords from the training corpus of texts and optimizing the first initial model based on the medical class keywords and the supervised labels is: determining a medical class feature word set according to the disease label, wherein the medical class feature word set comprises medical class feature words related to the target disease identified by the disease label; extracting medical class keywords included in the training text set based on the medical class feature word set, and generating medical class feature information corresponding to the medical class keywords; optimizing a first initial model according to the medical class characteristic information and the supervision label; wherein the medical class feature information comprises: the system comprises medical keywords and corresponding keyword feature values, wherein the keyword feature values are used for representing the importance levels of the medical keywords in the training text set.

In one embodiment, said optimizing the first initial model based on said medical class characteristic information and said supervised labels comprises: taking the medical characteristic information as an input parameter of a first initial model, and acquiring a disease identification result output by the first initial model; and if the disease indicated by the disease identification result output by the first initial model does not match the target disease, optimizing the first initial model.

In one embodiment, the implementation manner of the processing unit 602 configured to obtain the relevant words included in the training text set and optimize the second initial model based on the relevant words and the supervised tags is as follows: acquiring a sample word feature set of the training text set; screening the sample word feature set to obtain a relevant word feature set; optimizing a second initial model according to the relevant word feature set and the supervision label; the relevant word feature set comprises relevant word feature information, and the relevant word feature information comprises: the relevant words and the corresponding relevant word characteristic values thereof are determined according to the word frequency of the relevant words in the corresponding target texts.

In one embodiment, the screening the sample word feature set to obtain a related word feature set includes: performing initial screening on the sample word feature set to obtain a screened sample word set; sequencing all sample words in the screened sample word set according to the importance levels in the training set, and selecting the top N sample words as associated words according to a sequencing result, wherein N is a positive integer greater than 1; and generating a related word feature set according to the obtained related words.

In one embodiment, the initially screening the sample word feature set includes: deleting the first type sample words and/or the second type sample words; the first type of sample word comprises: sample words with word frequency larger than a first word frequency threshold value in the training text set or with times larger than a first time threshold value in the training text set; the second type of sample word comprises: and sample words with the word frequency smaller than a second word frequency threshold value in the training text set or with the frequency smaller than a second frequency threshold value in the training text set.

In one embodiment, the surveillance tag comprises a disease tag for identifying a target disease, the optimizing a second initial model according to the set of associated word features and the surveillance tag comprising: taking the relevant word feature information included in the relevant word feature set as an input parameter of a second initial model, and obtaining a disease identification result output by the second initial model; and if the disease indicated by the disease identification result output by the second initial model does not match the target disease, optimizing the second initial model.

In one embodiment, the processing unit 602 is further configured to: obtaining, from a medical system, a user identification code of a patient user who is recorded as having a target disease, the patient user being the target user, the supervision tag being a target disease tag.

In the implementation of the present invention, after a target user is selected by the processing unit 602, a surveillance tag may be set for the target user, the obtaining unit 601 obtains internet data corresponding to a user identification code of the target user as a training text set, the processing unit 602 determines medical keywords and some associated words from the training text set to perform optimization training on a first initial model and a second initial model respectively, so as to obtain the first model and the second model, and finally the processing unit 602 constructs a disease tag model according to the first model and the second model, so that the disease tag model can be ensured to have higher accuracy and wider coverage, and the accuracy of the disease tag model for performing disease estimation on a new user based on the internet data is improved.

Fig. 7 is a schematic structural diagram of a disease label constructing apparatus according to an embodiment of the present invention. The disease note constructing apparatus shown in fig. 7 may include an acquisition unit 701 and a processing unit 702.

In one embodiment, the obtaining unit 701 is configured to obtain internet data of a user to be detected; the processing unit 702 is configured to: determining medical characteristic information from the internet data of the user to be detected; inputting the medical characteristic information into a first model in a disease label model for recognition to obtain a first recognition result; determining relevant word characteristic information from the internet data of the user to be detected; inputting the relevant word feature information into a second model in the disease label model for recognition to obtain a second recognition result; and processing the first identification result and the second identification result to obtain the disease label of the user to be detected.

In an embodiment, the obtaining unit 701 is implemented as: determining a network identification of a user to be detected, and acquiring internet data associated with the network identification of the user to be detected, wherein the internet data comprises any one or more of internet reading data, information attention data, information release data and keyword search results.

In the embodiment of the invention, after the obtaining unit 701 obtains internet data of a user to be detected, the processing unit 702 determines medical characteristic information and associated word characteristic information from the internet data, inputs the medical characteristic information and the associated word characteristic information into the first model and the second model respectively to obtain a first recognition result and a second recognition result, and finally processes the first recognition result and the second recognition result to obtain a disease label of the user to be detected, so that the accuracy of the disease label of the user to be detected can be ensured.

Please refer to fig. 8, which is a schematic structural diagram of an intelligent device according to an embodiment of the present invention. The smart device shown in fig. 8 includes: one or more processors 801 and one or more memories 802, the processors 801 and the memories 802 being connected by a bus 803, the memories 803 being configured to store a computer program comprising first program instructions or second program instructions, the processors 801 being configured to execute the first program instructions or the second program instructions stored by the memories 802.

The memory 802 may include volatile memory (volatile memory), such as random-access memory (RAM); the memory 802 may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a solid-state drive (SSD), etc.; the memory 802 may also comprise a combination of the above-described types of memory.

The processor 801 may be a Central Processing Unit (CPU). The processor 801 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or the like. The PLD may be a field-programmable gate array (FPGA), a General Array Logic (GAL), or the like. The processor 801 may also be a combination of the above structures.

In an embodiment of the invention, the memory 802 is configured to store a computer program comprising first program instructions, and the processor 801 is configured to execute the first program instructions stored in the memory 802 to implement the steps of the corresponding method in the above-described embodiment of the method for constructing a medical model.

In one embodiment, the processor 801 is configured to invoke the program instructions for: determining a supervision tag for a target user, and acquiring internet data associated with a user identification code of the target user; determining a training text set of the target user according to the internet data; determining medical keywords from the training text set, and optimizing a first initial model based on the medical keywords and the supervision labels to obtain a first model; acquiring relevant words included in the training text set, and optimizing a second initial model based on the relevant words and the supervision labels to obtain a second model; and constructing a disease label model according to the obtained first model and the second model.

In one embodiment, the supervised labels comprise disease labels for identifying a target disease, and the processor 801 is configured to determine medical-like keywords from the training corpus of texts and optimize the first initial model based on the medical-like keywords and the supervised labels by: determining a medical class feature word set according to the disease label, wherein the medical class feature word set comprises medical class feature words related to the target disease identified by the disease label; extracting medical class keywords included in the training text set based on the medical class feature word set, and generating medical class feature information corresponding to the medical class keywords; optimizing a first initial model according to the medical class characteristic information and the supervision label; wherein the medical class feature information comprises: the system comprises medical keywords and corresponding keyword feature values, wherein the keyword feature values are used for representing the importance levels of the medical keywords in the training text set.

In one embodiment, the processor 801 in an embodiment for optimizing the first initial model based on the medical class characteristic information and the supervision labels is: taking the medical characteristic information as an input parameter of a first initial model, and acquiring a disease identification result output by the first initial model; and if the disease indicated by the disease identification result output by the first initial model does not match the target disease, optimizing the first initial model.

In one embodiment, the implementation of the processor 801 for obtaining the relevant words included in the training text set and optimizing the second initial model based on the relevant words and the supervision tags is as follows: acquiring a sample word feature set of the training text set; screening the sample word feature set to obtain a relevant word feature set; optimizing a second initial model according to the relevant word feature set and the supervision label; the relevant word feature set comprises relevant word feature information, and the relevant word feature information comprises: the relevant words and the corresponding relevant word characteristic values thereof are determined according to the word frequency of the relevant words in the corresponding target texts.

In one embodiment, the implementation manner of the processor 801 for filtering the sample word feature set to obtain the related word feature set is as follows: performing initial screening on the sample word feature set to obtain a screened sample word set; sequencing all sample words in the screened sample word set according to the importance levels in the training set, and selecting the top N sample words as associated words according to a sequencing result, wherein N is a positive integer greater than 1; and generating a related word feature set according to the obtained related words.

In one embodiment, the processor 801 performs the initial filtering on the sample word feature set by: deleting the first type sample words and/or the second type sample words; the first type of sample word comprises: sample words with word frequency larger than a first word frequency threshold value in the training text set or with times larger than a first time threshold value in the training text set; the second type of sample word comprises: and sample words with the word frequency smaller than a second word frequency threshold value in the training text set or with the frequency smaller than a second frequency threshold value in the training text set.

In one embodiment, the supervision tags comprise disease tags for identifying a target disease, and the processor 801 in an embodiment for optimizing the second initial model based on the set of associated word features and the supervision tags is: taking the relevant word feature information included in the relevant word feature set as an input parameter of a second initial model, and obtaining a disease identification result output by the second initial model; and if the disease indicated by the disease identification result output by the second initial model does not match the target disease, optimizing the second initial model.

In one embodiment, the processor 801 is further configured to: obtaining, from a medical system, a user identification code of a patient user who is recorded as having a target disease, the patient user being the target user, the supervision tag being a target disease tag.

In the intelligent device shown in fig. 8, the memory 802 is configured to store a computer program, the computer program includes second program instructions, and the processor 801 is configured to execute the second program instructions stored in the memory 802, so as to implement the steps of the corresponding method in the above embodiment of the disease label building method.

In one embodiment, the processor 801 is configured to invoke the program instructions for: acquiring internet data of a user to be detected; determining medical characteristic information from the internet data of the user to be detected, and inputting the medical characteristic information into a first model of a disease label model for identification to obtain a first identification result; determining relevant word characteristic information from the internet data of the user to be detected, and inputting the relevant word characteristic information into a second model of the disease label model for recognition to obtain a second recognition result; and processing the first identification result and the second identification result to obtain the disease label of the user to be detected.

In an embodiment, the implementation manner of the processor 801 for acquiring the internet data of the user to be detected is as follows: determining a network identification of a user to be detected, and acquiring internet data associated with the network identification of the user to be detected, wherein the internet data comprises any one or more of internet reading data, information attention data, information release data and keyword search results.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is intended to be illustrative of only some embodiments of the invention, and is not intended to limit the scope of the invention.

Claims

1. A method of constructing a medical model, comprising:

determining a supervision tag for a target user, and acquiring internet data associated with a user identification code of the target user, wherein the internet data comprises any one or more of article data, information data and search keywords browsed by the target user;

determining medical keywords from the training text set, and optimizing a first initial model based on the medical keywords and the supervision labels to obtain a first model, wherein the medical keywords comprise professional names directly related to preset diseases;

acquiring associated words included in the training text set, and optimizing a second initial model based on the associated words and the supervision labels to obtain a second model, wherein the associated words include professional names, terms and medical general words directly related to the preset diseases;

and constructing a disease label model according to the obtained first model and the second model, wherein the disease label model comprises the first model and the second model.

2. The method of claim 1, wherein the surveillance tags include disease tags for identifying a target disease, the determining a medical-like keyword from the training corpus of text, and optimizing a first initial model based on the medical-like keyword and the surveillance tags, comprises:

determining a medical class feature word set according to the disease label, wherein the medical class feature word set comprises medical class feature words related to the target disease identified by the disease label;

extracting medical class keywords included in the training text set based on the medical class feature word set, and generating medical class feature information corresponding to the medical class keywords;

optimizing a first initial model according to the medical class characteristic information and the supervision label;

wherein the medical class feature information comprises: the system comprises medical keywords and corresponding keyword feature values, wherein the keyword feature values are used for representing the importance levels of the medical keywords in the training text set.

3. The method of claim 2, wherein said optimizing a first initial model based on said medical class feature information and said supervised tags comprises:

taking the medical characteristic information as an input parameter of a first initial model, and acquiring a disease identification result output by the first initial model;

and if the disease indicated by the disease identification result output by the first initial model does not match the target disease, optimizing the first initial model.

4. The method of claim 1, wherein the obtaining of the associated words included in the training text set and optimizing a second initial model based on the associated words and the surveillance tags comprises:

acquiring a sample word feature set of the training text set;

screening the sample word feature set to obtain a relevant word feature set;

optimizing a second initial model according to the relevant word feature set and the supervision label;

the relevant word feature set comprises relevant word feature information, and the relevant word feature information comprises: the relevant words and the corresponding relevant word characteristic values thereof are determined according to the word frequency of the relevant words in the corresponding target texts.

5. The method of claim 4, wherein the filtering the sample word feature set to obtain a related word feature set comprises:

performing initial screening on the sample word feature set to obtain a screened sample word set;

sequencing all sample words in the screened sample word set according to the importance levels in the training set, and selecting the top N sample words as associated words according to a sequencing result, wherein N is a positive integer greater than 1;

and generating a related word feature set according to the obtained related words.

6. The method of claim 5, wherein the initial screening of the sample word feature set comprises:

deleting the first type sample words and/or the second type sample words;

the first type of sample word comprises: sample words with word frequency larger than a first word frequency threshold value in the training text set or with times larger than a first time threshold value in the training text set;

the second type of sample word comprises: and sample words with the word frequency smaller than a second word frequency threshold value in the training text set or with the frequency smaller than a second frequency threshold value in the training text set.

7. The method of claim 4, wherein the surveillance tags comprise disease tags for identifying a target disease, the optimizing a second initial model based on the set of related word features and the surveillance tags comprising:

taking the relevant word feature information included in the relevant word feature set as an input parameter of a second initial model, and obtaining a disease identification result output by the second initial model;

and if the disease indicated by the disease identification result output by the second initial model does not match the target disease, optimizing the second initial model.

8. The method of claim 1, wherein prior to determining a surveillance tag for a target user, the method further comprises:

obtaining, from a medical system, a user identification code of a patient user who is recorded as having a target disease, the patient user being the target user, the supervision tag being a target disease tag.

9. A medical model building apparatus, comprising an acquisition unit and a processing unit:

the acquisition unit is used for acquiring Internet data associated with a user identification code of a target user, wherein the Internet data comprises any one or more of article data, information data and search keywords browsed by the target user;

the processing unit is configured to:

determining a surveillance tag for a target user;

10. A disease label building device is characterized by comprising an acquisition unit and a processing unit:

the acquisition unit is used for acquiring the Internet data of a user to be detected, wherein the Internet data comprises any one or more of article data, information data and search keywords browsed by the user to be detected;

the processing unit is configured to:

determining medical characteristic information from the internet data of the user to be detected, wherein the medical characteristic information comprises medical keywords and corresponding keyword characteristic values, the keyword characteristic values are used for representing the importance levels of the medical keywords in the internet data, and the medical keywords comprise professional names directly related to preset diseases;

inputting the medical characteristic information into a first model in a disease label model for recognition to obtain a first recognition result, wherein the first model is obtained by training based on a training text set and medical keywords included in the training text set;

determining relevant word feature information from the internet data of the user to be detected, wherein the relevant word feature information comprises: the relevant words and the corresponding relevant word characteristic values thereof are determined according to the word frequencies of the relevant words in the corresponding target texts, and the relevant words comprise professional names, terms and medical general words directly related to the preset diseases;

inputting the relevant word feature information into a second model in the disease label model for recognition to obtain a second recognition result, wherein the second model is obtained based on the training text set and relevant words included in the training text set through training;

processing the first identification result and the second identification result to obtain a disease label of the user to be detected;

wherein, the processing the first identification result and the second identification result to obtain the disease label of the user to be detected includes: and performing weighted average operation on the probability of existence of each disease identified by the first identification result and the probability identification result of existence of the corresponding disease identified by the second identification result, wherein the operation result is used as a disease label of the user to be detected.

11. An intelligent device, comprising a processor and a memory for storing a computer program comprising first program instructions, the processor being configured to invoke the first program instructions to perform the method of constructing a medical model according to any one of claims 1 to 8.

12. A computer storage medium, characterized in that the computer storage medium has stored therein first computer program instructions for performing, when executed by a processor, the method of constructing a medical model according to any one of claims 1-8.