CN116598004A

CN116598004A - Prevalence prediction method, prevalence prediction device, computer device, and storage medium

Info

Publication number: CN116598004A
Application number: CN202310869443.3A
Authority: CN
Inventors: 张敏; 李佳玉; 刘奕群; 马少平; 苏航; 张抒扬; 金晔; 张磊
Original assignee: Tsinghua University; Peking Union Medical College Hospital Chinese Academy of Medical Sciences
Current assignee: Tsinghua University; Peking Union Medical College Hospital Chinese Academy of Medical Sciences
Priority date: 2023-07-17
Filing date: 2023-07-17
Publication date: 2023-08-15
Anticipated expiration: 2043-07-17
Also published as: CN116598004B

Abstract

The application relates to a method, a device, computer equipment and a storage medium for predicting prevalence. The method comprises the following steps: extracting a session data set from a historical database of a search engine according to the keyword table; inputting the session data set into a session classification model to perform session classification, so as to obtain a classification result; screening the session data set according to the classification result to obtain a target session data set; and inputting the target session data set into a prediction model to perform prevalence prediction, and obtaining a prediction result. The method can improve the prediction accuracy of the prevalence rate of rare diseases.

Description

Prevalence prediction method, prevalence prediction device, computer device, and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and apparatus for predicting an prevalence, a computer device, and a storage medium.

Background

Rare diseases are a collective term for a group of diseases with extremely low incidence. According to an overview study, on average, individual countries and organizations define rare diseases as those with a incidence of less than 40 to 50 per 10 tens of thousands of people. Because rare diseases are difficult to diagnose in early stages of onset, each rare patient takes an average of 6-8 years to be diagnosed accurately. Huge diseased populations are difficult to locate and rare disease related studies are difficult to develop effectively. Therefore, the method has important significance for early screening and identification of rare diseases in diagnosis and research of the rare diseases.

Currently, in the internet era, a search engine has become an important way for acquiring information in daily life of people, and a user searches query data related to diseases on the search engine, so that certain disease conditions and development trends can be reflected. Therefore, the early screening and identifying method of the rare diseases can utilize the search engine to search data, and further predict the crowd distribution situation of the rare diseases directly based on the searched data.

However, the above method of predicting the prevalence of rare diseases has a problem of low accuracy.

Disclosure of Invention

Based on this, it is necessary to provide a prediction method, apparatus, computer device, and storage medium capable of improving the accuracy of predicting the prevalence of rare diseases, in view of the above-described technical problems.

In a first aspect, the present application provides a method of predicting prevalence. The method comprises the following steps:

extracting a session data set from a historical database of a search engine according to the keyword table;

inputting the session data set into a session classification model to perform session classification, so as to obtain a classification result; the session classification model is used for classifying the intention of the session;

screening the session data set according to the classification result to obtain a target session data set;

And inputting the target session data set into a prediction model to perform prevalence prediction, and obtaining a prediction result.

In one embodiment, the session classification model includes a long-term and short-term memory sub-model and a multi-layer perception sub-model, and the session data set is input into the session classification model to perform session classification, so as to obtain a classification result, including:

extracting features of the session data set to obtain session features and query features;

inputting the query features into the long-term and short-term memory submodel for vector conversion to obtain vector features corresponding to the query features;

and inputting vector features and session features corresponding to the query features into a multi-layer perceptron sub-model for classification, and obtaining a classification result.

In one embodiment, screening the session data set according to the classification result to obtain a target session data set includes:

determining a session belonging to the rare disease-related type and a session belonging to the hot news-related type in the session data set according to the classification result;

the target session data set is constructed from rare disease-related type sessions and hot news-related type sessions.

In one embodiment, extracting a session dataset from a historical database of a search engine according to a keyword table includes:

Extracting rare disease related queries matched with the first-level keywords from a historical database of a search engine according to the first-level keywords in the keyword table;

a session dataset is extracted from the historical database according to the rare disease-related queries that match the first level keywords.

In one embodiment, extracting a session dataset from a historical database from rare disease-related queries that match a first level keyword comprises:

extracting other queries adjacent to the rare disease related query from the historical database according to the rare disease related query matched with the first-level keyword;

the session data set is extracted from the historical database according to other queries adjacent to the rare related query and the rare related query.

In one embodiment, the method further comprises:

acquiring an initial keyword list;

and carrying out keyword expansion on the initial keyword list to obtain the keyword list.

In one embodiment, keyword expansion is performed on an initial keyword table to obtain a keyword table, including:

processing each keyword in the initial keyword list to obtain a new keyword; the processing comprises at least one of symbol text, letter text, digital text, abbreviated text and preset characters for each keyword;

And adding the new keywords into the initial keyword list to obtain the keyword list.

In a second aspect, the application further provides a device for predicting the prevalence rate. The device comprises:

the extraction module is used for extracting a session data set from a historical database of the search engine according to the keyword list;

the classification module is used for inputting the session data set into the session classification model to perform session classification to obtain a classification result; the classification model is used for classifying the intention of the session;

the screening module is used for screening the session data set according to the classification result to obtain a target session data set;

and the prediction module is used for inputting the target session data set into the prediction model to perform prevalence prediction to obtain a prediction result.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

According to the prediction method, the prediction device, the computer equipment and the storage medium for the prevalence, the session data set is firstly extracted from the historical database of the search engine according to the keyword table, then the session data set is input into the session classification model for session classification to obtain the classification result, the session data set is screened according to the classification result to obtain the target session data set, and the target session data set is input into the prediction model for prevalence prediction to obtain the prediction result. According to the method, each session in the session data set is classified according to the searching intention of the session, so that the classified session data set contains three main types of session data including rare disease correlation, hot news correlation and other intention, and the session data set can be screened according to the classification result in the later stage so as to remove noise caused by query of which the actual intention is irrelevant to the rare disease, and further improve the accuracy of rare disease prevalence prediction.

Drawings

FIG. 1 is a diagram of an application environment for a method of predicting prevalence in one embodiment;

FIG. 2 is a flow chart of a method for predicting prevalence in one embodiment;

FIG. 3 is a schematic flow chart of step S202 in the embodiment of FIG. 2;

FIG. 4 is a schematic flow chart of step S203 in the embodiment of FIG. 2;

FIG. 5 is a schematic flow chart of step S201 in the embodiment of FIG. 2;

FIG. 6 is a schematic flow chart of step S502 in the embodiment of FIG. 5;

FIG. 7 is a flow chart of a method for predicting prevalence in another embodiment;

FIG. 8 is a schematic flow chart of step S206 in the embodiment of FIG. 7;

FIG. 9 is a flow chart of a method for predicting prevalence in another embodiment;

FIG. 10 is a block diagram of a device for predicting prevalence in one embodiment;

FIG. 11 is a block diagram of a device for predicting prevalence in one embodiment;

FIG. 12 is a block diagram of a device for predicting prevalence in one embodiment;

FIG. 13 is a block diagram of a device for predicting prevalence in one embodiment;

FIG. 14 is a block diagram of a device for predicting prevalence in one embodiment;

fig. 15 is a block diagram showing a configuration of a device for predicting prevalence in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In the internet age, search engines have become an important way for people to obtain information in daily life, and users can reflect disease conditions and development trends to a certain extent by searching query data related to diseases on the search engines. Thus, searching data using a search engine in early screening and identification of rare disorders facilitates preliminary predictions of population distribution of rare disorders. However, the above method of predicting the prevalence of rare diseases has a problem of low accuracy. The present application aims to solve this problem.

After the background technology of the method for predicting the prevalence provided by the embodiment of the present application is described, an implementation environment related to the method for predicting the prevalence provided by the embodiment of the present application will be briefly described below. The prediction method of the prevalence rate provided by the embodiment of the application can be applied to computer equipment shown in figure 1. The computer device comprises a processor, a memory, and a computer program stored in the memory, wherein the processor is connected through a system bus, and when executing the computer program, the processor can execute the steps of the method embodiments described below. Optionally, the computer device may further comprise an input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium having stored therein an operating system, computer programs, and a database, an internal memory. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used for communicating with an external terminal through a network connection. Optionally, the computer device may be a server, a personal computer, a personal digital assistant, other terminal devices, such as a tablet computer, a mobile phone, etc., or a cloud or remote server, and the embodiment of the present application does not limit a specific form of the computer device.

After the application scenario of the method for predicting the prevalence provided by the embodiment of the present application is described, the method for predicting the prevalence described in the present application is described in detail below.

In one embodiment, as shown in fig. 2, a method for predicting prevalence is provided, and the method is applied to the computer device in fig. 1, for example, and includes the following steps:

s201, extracting a session data set from a historical database of a search engine according to the keyword list.

Wherein, the keyword list is composed by extracting three-level keywords of each rare disease according to three specificity levels of the rare disease; it should be noted that, the keywords of rare diseases may be classified into three levels according to the specificity from high to low, the keywords of the first level: each rare disease specific keyword provided by medical specialists comprises Chinese and English names, specific variant gene names, specific etiology, specific clinical manifestation, specific auxiliary examination and specific treatment methods of the disease; keywords of the second level: nonspecific etiology, clinical manifestation, auxiliary examination and treatment methods of each rare disease disclosed in the chinese diagnosis and treatment guidelines for rare diseases (2019); keywords of the third level: the general medically relevant vocabulary disclosed in the medical vocabulary.

The historical databases of the search engine include, but are not limited to, hundred degree databases, google databases, dog search databases, quark databases, and the like.

Wherein the session data set may include session data generated during a plurality of user searches. The session data of each user may include, but is not limited to, information such as a location and a website of a document clicked by each user in a query process, a time of searching each user, a region where each user IP is located, and the like when the computer device identifies that each user inputs keywords of level one, level two and level three related to rare diseases in a history database of a search engine.

In the embodiment of the application, when the prevalence rate of a certain rare disease in a certain region for a certain time period is required to be predicted, the computer equipment can acquire the rare disease related query matched with the keyword information in the keyword table by the user according to the keyword information in the keyword table, and acquire other query contents before querying the rare disease related query or after querying the rare disease related query by the user, wherein the rare disease related query and other queries jointly form a search session; and simultaneously extracting document information, user information and the like clicked by the user in each inquiry in the search session from a historical database, wherein the information together form a search session data set.

S202, inputting the session data set into a session classification model to perform session classification, and obtaining a classification result.

The conversation classification model is used for classifying the intention of the conversation, wherein the intention essence of the conversation represents the intention of a user searching keywords; it is noted that the intent of the conversation may include, but is not limited to, rare disease related, hot news related, and other intents, and the like.

In the embodiment of the application, after the session data set is acquired, the computer equipment can input the acquired session data set into the session classification model, and the session classification model can classify each session data in the session data set according to the classification result of each session data; optionally, the computer device may extract the features of the acquired session data set first to obtain features of each session data in the session data set, and analyze the features of each session data based on the features of each session data, so as to classify each session data in the session data set, and obtain a classification result. For example, if the content of a certain session data in the session data set includes multiple sclerosis, immune response onset, multiple acute onset, oligoclonal, specific treatment, https:// www.so.com/s, 15:31, xx province xx city, the session is input into a session classification model, and the session model may classify the session into a rare disease-related classification result according to the content of the session.

S203, screening the session data set according to the classification result to obtain a target session data set.

The target session data set may include, but is not limited to, rare disease-related type sessions and hot news-related type sessions, among others.

In the embodiment of the application, after the classification result of the session data set is obtained, the computer equipment can screen all session data of the session data set according to the classification result, and screen out session data about other intentions in the session data set to obtain a target session data set; optionally, sessions related to rare disease-related types and sessions related to hot news-related types in the session data set may be filtered out to obtain the target session data set.

S204, inputting the target session data set into a prediction model to perform prevalence prediction, and obtaining a prediction result.

Wherein the prediction model is used for predicting the probability of a rare disease in a certain region for a certain period of time, for example, the prediction model may be a disease space weighted linear regression model, and the disease space weighted linear regression model may be represented by the following formula (1):

（1）；

wherein, the liquid crystal display device comprises a liquid crystal display device,indicated at time period +. >Interior area>Rare diseases->Predictive value of prevalence of +.>、And->All are diseases->Parameter of->、/>And->All are areas->Is used for the control of the temperature of the liquid crystal display device,representing the number of rare related query sessions, +.>Indicating the number of hot news query sessions.

In the embodiment of the application, the computer equipment models a prediction model of the prevalence of the rare disease by using a disease space weighted linear regression model, and inputs the obtained target session data set and the actual prevalence of a plurality of rare diseases into the established prediction model for training to obtain the trained prediction model of the prevalence of the rare disease. It should be noted that the actual prevalence of a plurality of rare diseases can be obtained from a rare disease clinical database of a medical institution.

Illustratively, the computer device uses a two-year target session dataset of 2016-2017 as a training set, a two-year target session dataset of 2018 as a verification set, a two-year target session dataset of 2019 as a test set, an adaptive moment estimation algorithm (Adam) as an optimizer, a Root-Mean-square Error (RMSE) as a training loss function (e.g., as shown in equation (2)), an early-stop method (earlytop) is used to determine a model training termination round, and a Root-Mean-square Error on the verification set is used as a measure of training stop. If the root mean square error of 20 training rounds is not reduced, stopping training to obtain a trained prediction model.

（2）；

Wherein, the liquid crystal display device comprises a liquid crystal display device,RMSEthe root mean square error is indicated as,D、L、Trespectively the number of diseases, regions and time periods in the dataset, 、/> 、/>respectively represent the firstiDisease of the seed, the firstjIndividual region and thkTime period(s)>Representing the actual prevalence of rare diseases, +.>Representing the predicted prevalence of rare diseases.

Further, after the trained prediction model is obtained, a target session data set of a certain time period, a certain region and a certain rare disease is input into the trained prediction model to conduct prevalence prediction, and a predicted value of the prevalence of the certain rare disease in the certain time period, the certain region and the certain rare disease is obtained.

According to the prediction method of the prevalence rate, firstly, a session data set is extracted from a historical database of a search engine according to a keyword table, then the session data set is input into a session classification model to conduct session classification, a classification result is obtained, the session data set is screened according to the classification result, a target session data set is obtained, and the target session data set is input into a prediction model to conduct prevalence rate prediction, so that a prediction result is obtained. According to the method, each session in the session data set is classified according to the searching intention of the session, so that the classified session data set contains three main types of session data including rare disease correlation, hot news correlation and other intention, and the session data set can be screened according to the classification result in the later stage so as to remove noise caused by query of which the actual intention is irrelevant to the rare disease, and further improve the accuracy of rare disease prevalence prediction.

In one embodiment, based on the embodiment shown in fig. 2, the session classification model includes a long-term and short-term memory sub-model and a multi-layer perceptron sub-model, and the process of inputting the session data set into the session classification model to perform session classification and obtain the classification result is described, as shown in fig. 3, the step of S202 "inputting the session data set into the session classification model to perform session classification and obtain the classification result" may include the following steps:

s301, extracting features of the session data set to obtain session features and query features.

Session features may include, but are not limited to, total number of queries in a session, number of primary queries, number of disease queries, number of medical queries, total number of clicks by a user in a session, deepest ranking position of clicked documents, average ranking position, etc. Query characteristics may include query category, query length, number of clicks in the query, frequency of occurrence of query terms, and so forth. Exemplary, session features for a single search session in a session dataset are represented as vectorssThe method comprises the steps of carrying out a first treatment on the surface of the All queries in a session are characterized as a matrixQ。

In the embodiment of the application, after the session data set is obtained, the computer equipment can perform feature extraction on the obtained session data set, and extract session features and query features of the session data set; optionally, the computer device may perform feature extraction on the obtained session data set through the feature extraction model, so as to extract session features and query features of the session data set; or, the computer device may perform feature extraction on the obtained session data set through the feature extraction network, so as to extract session features and query features of the session data set.

S302, inputting the query features into the long-term and short-term memory submodel for vector conversion to obtain vector features corresponding to the query features.

The Long Short-Term Memory model (LSTM) can extract sequence information of the query feature, and vector-convert the sequence information of the query feature to obtain vector features corresponding to the query feature.

In the embodiment of the application, after the query feature of the session data set is obtained, the computer equipment can input the query feature into the long-short-term memory submodel, the long-short-term memory submodel can extract the sequence information of the query feature, and vector conversion is carried out on the sequence information of the query feature to obtain the vector feature corresponding to the query feature. Exemplary, query featuresQInputting into long-short-term memory submodel, extracting inquiry features from long-short-term memory submodelQIs to query the features againQVector conversion is carried out on the sequence information of the query feature to obtain the vector feature corresponding to the query feature. The vector features may be expressed asdThe long-short term memory submodel can be expressed by the following formula (3):

（3）；

wherein, the liquid crystal display device comprises a liquid crystal display device,drepresenting the corresponding vector features of the query feature,Qthe characteristics of the query are represented and, Representing query pairs featuresQVector conversion is performed according to the long-short-term memory submodel.

S303, inputting vector features and session features corresponding to the query features into a multi-layer perceptron sub-model for classification, and obtaining a classification result.

Wherein the multi-layer perceptron sub-model (Multilayer Perceptron, MLP) can classify user intent of the session dataset.

In the embodiment of the application, after the vector features corresponding to the query features are obtained, the computer equipment can splice the vector features corresponding to the query features and the session features to obtain spliced feature vectors, and then input the spliced feature vectors into a multi-layer perception machine sub-model for classification to obtain classification results; it should be noted that the classification result may be any one of rare disease-related session data, hot news-related session data, and other intended session data. Exemplary, the vector features corresponding to the query features are obtaineddThereafter, the computer device may first characterize the sessionsVector features corresponding to query featuresdSplicing to obtain spliced feature vectorsfThen the spliced feature vectorsfInputting the classification result into the multi-layer perception machine submodel to obtain the classification result. The multi-layer perceptron submodel may be represented by the following equation (4):

（4）；

Wherein, the liquid crystal display device comprises a liquid crystal display device,and->Respectively represent the firstlLayer numberiSum of the valuesl+1Layer numberjPersonal value (s)/(s)> 、/>Weight and bias parameters for the layer full connection, respectively, < >>Is a nonlinear activation function. It is noted that layer 0, i.e.)>For the input feature vector, the last layer, i.e. l=l,/is given as->Predictive value of classification label for intention of C usersj=...C-1）。

According to the method for classifying the session data sets, the session data sets are classified according to the characteristics of the session data sets, the session data sets are screened according to the classification results, the basis is provided for removing noise caused by query of which the actual intention is irrelevant to rare diseases, and the basis is provided for improving the accuracy of rare disease prevalence prediction.

In one embodiment, the process of screening the session data set according to the classification result to obtain the target session data set may be described on the basis of the embodiment shown in fig. 3, and as shown in fig. 4, the step of S203 "screening the session data set according to the classification result to obtain the target session data set" may include the following steps:

s401, determining the session belonging to the rare disease related type and the hot news related type in the session data set according to the classification result.

In the embodiment of the application, after the classification result of the session data set is determined, the computer equipment can screen out the session of which the session data set belongs to the rare disease related type and the session of which the hot news related type in the classification result.

S402, constructing a target session data set according to the rare disease-related type session and the hot news-related type session.

In the embodiment of the application, after the session data set is screened out to belong to the session of the rare disease related type and the session of the hot news related type, the computer equipment can target the session data set at the construction site of the session of the rare disease related type and the session of the hot news related type. Exemplary, the user intent in the session dataset counts the number of sessions of each time, each region, each rare disease-related type, and the hot news-related type for rare disease-related type sessions, counts the number of rare disease-related type sessions of D rare diseases in L regions and T time periods asThe number of sessions of the hot news related type is +.>Wherein D represents the kind of rare disease, L represents a certain region, T represents a certain period of time, </i > >，/>，/>Respectively represent the firstiDisease of the seed, the firstjIndividual region and thkTime period(s)>And->The number of sessions of rare disease-related types of a certain rare disease in a certain region and a certain period of time, and the number of sessions of hot news-related types of a certain rare disease in a certain region and a certain period of time, respectively.

According to the method for screening the session data set, provided by the embodiment of the application, the session data set is screened according to the classification result, so that noise caused by query of which the actual intention is irrelevant to the rare disease is removed, and the accuracy of the rare disease prevalence prediction is improved; in addition, the sessions related to the rare diseases and the sessions related to the hot news are independently identified, so that noise in the search session data set is further removed, meanwhile, the explosive news related to the rare diseases is identified, the increase of the predicted value of the prevalence rate of the rare diseases caused by the sharp increase of news related queries is avoided, and the accuracy of the prediction of the prevalence rate of the rare diseases is further improved.

In one embodiment, the process of extracting the session data set from the history database of the search engine according to the keyword table may be described on the basis of the embodiment shown in fig. 4, and as shown in fig. 5, the step of S201 "extracting the session data set from the history database of the search engine according to the keyword table" may include the steps of:

S501, extracting rare disease related queries matched with the keywords of the first level from a history database of a search engine according to the keywords of the first level in the keyword list.

Wherein the rare disease related query matching the first level keyword is a query record matching the first level keyword searched in a history database of the search engine.

In the embodiment of the application, when the prevalence rate of a certain rare disease in a certain region for a certain time period is required to be predicted, the computer equipment firstly queries the historical database from the user of the search engine to match the keywords of the first level, and after the computer matches the keywords of the first level, the candidate session data set related to the keywords of the first level can be continuously extracted from the historical database of the search engine according to the keywords of the first level in the keyword table.

S502, extracting a session data set from a historical database according to the rare disease related query matched with the first-level keyword.

The session data set is included in a historical database of the search engine, and the ordering position of the documents clicked by the user, the web addresses of the documents, the time of carrying out the search session and the region where the user IP is located. Illustratively, if a query containing a certain keyword of a certain rare disease of a first level is matched in the historical data of the search engine, sequentially traversing all query records of an initiating user of the query within a certain period of time before and after the query in the historical database of the search engine, wherein all query records comprise query text input by the user in the query, ordering positions of clicked documents, websites of the documents, time of conducting the search session and region where the user IP is located.

In the embodiment of the application, after the rare disease related query matched with the first-level keyword is obtained, the computer device may extract the session dataset from the historical database according to the rare disease related query matched with the first-level keyword.

Illustratively, as shown in fig. 6, the step S502 "extract a session data set from a history database according to the rare disease related query matched by the first level keyword", includes:

s5021, extracting other queries adjacent to the rare disease related query from the historical database according to the rare disease related query matched with the first-level keywords.

Wherein the other queries adjacent to the rare related query may be related queries of the user after matching the rare related query containing the first level keyword, before the computer device continues to search the history database of the search engine for the rare related query matching the first level keyword, and after the rare related query matching the first level keyword.

In the embodiment of the application, after acquiring the rare disease related query matched with the first-level keyword, the computer equipment can continuously extract other queries adjacent to the rare disease related query from the historical database according to the acquired rare disease related query matched with the first-level keyword.

And S5022, extracting session data sets from the historical database according to other adjacent queries of the rare disease related queries and the rare disease related queries.

In the embodiment of the application, after acquiring other queries adjacent to the rare disease related query, the computer device can extract a first candidate session data set from the historical database according to the other queries adjacent to the rare disease related query, wherein the first candidate session data set comprises the ordering position of the document, the website of the document, the time for carrying out the search session, the region where the user IP is located and the like when the user clicks the other queries adjacent to the rare disease related query. Further, the computer device may extract a second candidate session data set from the historical database based on the rare disease-related query, wherein the second candidate session data set includes a ranking position of the documents, a web address of the documents, a time at which the search session was conducted, an area in which the user IP was located, and the like when the user clicks on the rare disease-related query. And then, the computer equipment performs statistical operation on the first candidate session data set and the second candidate session data set to obtain a session data set.

According to the method for acquiring the session data set, the query related to the first level of keywords in the keyword list and the query adjacent to the query related to the first level of keywords in the history database of the search engine are considered, so that the session data set in the searched history database is more comprehensive, the data classified according to the session data set is more comprehensive and complete, and the classification result is more accurate.

In one embodiment, based on the embodiment shown in fig. 2, as shown in fig. 7, the method further includes:

s205, acquiring an initial keyword list.

Wherein, the initial keyword table can be composed by extracting three-level keywords of each rare disease according to three specificity levels of the rare disease. Illustratively, taking the first 3 of the 15 diseases as an example, the names of the 15 rare diseases are shown in table 1, and the initial keywords of the first level, the initial keywords of the second level, and the initial keywords of the third level are shown in table 2.

TABLE 1

TABLE 2

In the embodiment of the application, the computer equipment can construct an initial keyword list according to the first-level keywords provided by medical specialists, the second-level keywords provided in diagnosis and treatment guidelines and the third-level keywords provided in medical word lists, and when the initial keywords need to be processed, the computer equipment directly acquires the initial keyword list.

In the embodiment of the application, the initial keyword list acquired by the computer equipment is used for carrying out keyword expansion on the initial keyword list subsequently to acquire the keyword list.

S206, expanding keywords of the initial keyword list to obtain the keyword list.

In the embodiment of the application, after the initial keyword list is obtained, the computer equipment can perform keyword expansion on the initial keyword list to obtain the expanded keyword list. Optionally, the computer device may input the keywords in the initial keyword table into the keyword expansion software to perform keyword expansion, so as to obtain an expanded keyword table.

As shown in fig. 8, the step S206 "performing keyword expansion on the initial keyword table to obtain a keyword table" includes:

s2061, processing each keyword in the initial keyword list to obtain a new keyword.

The processing includes at least one of symbol text, letter text, number text, abbreviation text and preset text for each keyword. Firstly, carrying out symbol text processing on each keyword, and if the keywords contain punctuation marks (such as brackets, quotation marks and the like), removing or replacing the punctuation marks with blank spaces under an English/Chinese input method; for example, performing alphabetic text processing on each keyword, and if the keywords contain Greek letters, replacing the Greek letters with English translations and common Chinese translations (such as "alpha" is replaced with "alpha" and "alpha"); thirdly, carrying out digital text processing on each keyword, and if the keywords comprise any one of Roman numerals, arabic numerals or Chinese numerical expressions, adding the other two forms into a keyword list; in a fourth example, performing preset word processing on each keyword, and if the keyword contains "symptom" or "symptom", replacing the "symptom" or "symptom" in the keyword with "symptom" or "symptom"; in example five, abbreviation Wen Bo is performed on each keyword, and if the keywords include only english abbreviations, the abbreviation keywords are deleted, and medical related words such as "illness", "symptom" are added as new keywords.

In the embodiment of the application, after the initial keyword list is obtained, the computer equipment can expand keywords in the initial keyword list according to the preset symbol text, the letter text, the number text, the abbreviation text and the preset word processing rule to obtain new keywords.

S2062, adding the new keywords into the initial keyword list to obtain the keyword list.

In the embodiment of the application, after the new keyword is obtained, the computer device may add the obtained new keyword to the initial keyword table to obtain the new keyword table. Illustratively, the initial keywords in table 2 are expanded according to the keyword expansion rules described above, resulting in a keyword table as shown in table 3.

According to the method for expanding the initial keyword list, the initial keyword list is expanded according to the preset processing rule, and even if the phenomenon of word leakage and word error occurs in the process of inquiring rare diseases by using a search engine, the predicted result of the prevalence is more accurate due to the fact that the expanded keyword list is matched with the input intention of the user.

TABLE 3 Table 3

In one embodiment, the accuracy of the prediction result is calculated based on the obtained prediction value of the prevalence rate of a rare disease in a certain region in a certain time period and the actual prevalence rate of a rare disease in the same region in the same time period. It should be noted that, the measurement index of the accuracy of the prediction result may include Root-Mean-square Error (RMSE) as shown in the following formula (5) and relative Error rate (Relative Error Rate, RER) as shown in the following formula (6):

（5）；

（6）；

wherein, the liquid crystal display device comprises a liquid crystal display device,RMSEthe root mean square error is indicated as,D、L、T’representing the number of diseases, regions and time periods (e.g. quarters) in the dataset, 、/> 、/>respectively represent the firstiDisease of the seed, the firstjIndividual region and thkTime period(s)>Representing the actual prevalence of rare diseases, +.>A predicted prevalence indicative of rare diseases; RER represents the relative error rate.

TABLE 4 Table 4

Two other methods for predicting the prevalence of rare diseases are described below: first, based on a general linear regression of the sessions, the method uses the number of rare disease-related query sessions obtained in the above-described embodimentDirectly performing general linear regression; second, weighted linear regression based on queries: using the number of rare queries obtained in the above embodiments, subsequent weighted linear regression is directly performed without removing noise of irrelevant queries through the user intent prediction model. The method of this scheme is compared with the rare disease prevalence predicted by the two methods, and the result and confidence interval of the prediction accuracy are shown in table 4, and it should be noted that this method is referred to as weighted linear regression based on session in table 4. / >

From table 4, it can be seen that the weighted linear regression method based on the session can greatly improve the accuracy of the prevalence of the rare disease, that is, the accuracy of the prevalence determined by the prediction method of the prevalence of the rare disease provided by the scheme is higher.

In summary of all the above embodiments, as shown in fig. 9, a complete method for predicting prevalence is provided, which includes:

s10, acquiring an initial keyword list;

s11, processing each keyword in the initial keyword list to obtain a new keyword;

s12, adding the new keywords into the initial keyword list to obtain the keyword list;

s13, extracting rare disease related queries matched with the first-level keywords from a historical database of the search engine according to the first-level keywords in the keyword list;

s14, extracting other queries adjacent to the rare disease related query from the historical database according to the rare disease related query matched with the first-level keyword;

s15, extracting a session data set from a historical database according to other adjacent queries of the rare disease related queries and the rare disease related queries;

s16, extracting features of the session data set to obtain session features and query features;

s17, inputting the query features into the long-term and short-term memory submodel for vector conversion to obtain vector features corresponding to the query features;

S18, inputting vector features and session features corresponding to the query features into a multi-layer perceptron sub-model for classification, and obtaining classification results;

s19, determining a session belonging to the rare disease related type and a session belonging to the hot news related type in the session data set according to the classification result;

s20, constructing a target session data set according to the rare disease related type session and the hot news related type session;

s21, inputting the target session data set into a prediction model to perform prevalence prediction, and obtaining a prediction result.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a device for predicting the prevalence of the above-mentioned related prevalence prediction method. The implementation of the solution provided by the device is similar to that described in the above method, so the specific limitation in the embodiments of the device for predicting the prevalence of one or more prevalence is provided below, which is referred to above for limitation of the method for predicting the prevalence, and is not described herein.

In one embodiment, as shown in fig. 10, there is provided a prediction apparatus of prevalence, comprising: extraction module 10, classification module 20, screening module 30, and prediction module 40, wherein:

the extracting module 10 is configured to extract a session data set from a history database of the search engine according to the keyword table.

The classification module 20 is configured to input the session data set into a session classification model for performing session classification, so as to obtain a classification result; the session classification model is used to classify the intent of the session.

And the screening module 30 is configured to screen the session data set according to the classification result, so as to obtain a target session data set.

And the prediction module 40 is used for inputting the target session data set into a prediction model to perform prevalence prediction, so as to obtain a prediction result.

In one embodiment, the session classification model includes a long-short term memory sub-model and a multi-layer perceptron sub-model, as shown in FIG. 11, classification module 20 includes: a first extraction unit 200, a conversion unit 201, a classification unit 202, wherein:

the first extracting unit 200 is specifically configured to perform feature extraction on the session data set to obtain session features and query features;

the conversion unit 201 is specifically configured to input the query feature to the long-short-term memory submodel for vector conversion, so as to obtain a vector feature corresponding to the query feature;

The classification unit 202 is specifically configured to input the vector features and the session features corresponding to the query features into the multi-layer perceptron sub-model to classify, so as to obtain a classification result.

In one embodiment, as shown in fig. 12, the screening module 30 includes: a determining unit 300, a constructing unit 301, wherein:

a determining unit 300, specifically configured to determine, according to the classification result, a session belonging to the rare disease-related type and a session belonging to the hot news-related type in the session data set;

the construction unit 301 is specifically configured to construct the target session data set according to the rare disease-related type session and the hot news-related type session.

In one embodiment, as shown in fig. 13, the extraction module 10 includes: a second extraction unit 100, a third extraction unit 101, wherein:

a second extracting unit 100, configured to extract, from a history database of the search engine, a rare disease-related query matching the first level keyword according to the first level keyword in the keyword table;

a third extraction unit 101, specifically configured to extract a session dataset from a history database according to the rare disease related query matched by the first-level keyword;

in one embodiment, the third extracting unit 101 is specifically configured to extract, from the history database, other queries adjacent to the rare disease-related query according to the rare disease-related query matched by the first level keyword; a session dataset is extracted from the historical database based on other queries adjacent to the rare related query and the rare related query.

In one embodiment, as shown in fig. 14, the apparatus further includes:

an obtaining module 50, configured to obtain an initial keyword table;

the expansion module 60 is configured to perform keyword expansion on the initial keyword table to obtain a keyword table.

In one embodiment, as shown in fig. 15, the expansion module 60 includes a processing unit 600, an adding unit 601, where:

the processing unit 600 is specifically configured to process each keyword in the initial keyword table to obtain a new keyword; the processing comprises at least one of symbol text, letter text, digital text, abbreviation text and preset characters for each keyword;

the adding unit 601 is specifically configured to add a new keyword to the initial keyword table, thereby obtaining a keyword table.

The respective modules in the above-described prevalence prediction apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in FIG. 1. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing keyword data. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of predicting prevalence.

It will be appreciated by those skilled in the art that the architecture shown in fig. 1 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements may be implemented, as a particular computer device may include more or less components than those shown, or may be combined with some components, or may have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:

In one embodiment, the processor when executing the computer program further performs the steps of:

extracting other queries adjacent to the rare disease-related query from the historical database according to the rare disease-related query matched by the first-level keyword;

A session dataset is extracted from the historical database based on other queries adjacent to the rare related query and the rare related query.

acquiring an initial keyword list;

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, the computer program when executed by the processor further performs the steps of:

acquiring an initial keyword list;

In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:

acquiring an initial keyword list;

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as Static Random access memory (Static Random access memory AccessMemory, SRAM) or dynamic Random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method of predicting prevalence, the method comprising:

And inputting the target session data set into a prediction model to perform prevalence prediction to obtain a prediction result.

2. The method according to claim 1, wherein the session classification model includes a long-term and short-term memory sub-model and a multi-layer perceptron sub-model, the inputting the session data set into the session classification model for session classification, and obtaining a classification result includes:

inputting the query features into the long-short-term memory submodel for vector conversion to obtain vector features corresponding to the query features;

and inputting the vector features and the session features corresponding to the query features into the multi-layer perceptron sub-model for classification, and obtaining the classification result.

3. The method according to claim 1 or 2, wherein the screening the session data set according to the classification result to obtain a target session data set comprises:

determining a session belonging to a rare disease related type and a hot news related type in the session data set according to the classification result;

and constructing the target session data set according to the rare disease related type session and the hot news related type session.

4. The method according to claim 1 or 2, wherein the extracting the session dataset from the historical database of the search engine according to the keyword table comprises:

and extracting the session data set from the historical database according to the rare disease related query matched with the first-level keyword.

5. The method of claim 4, wherein the extracting the session dataset from the historical database according to the rare related query matching the first level keyword comprises:

extracting other queries adjacent to the rare disease-related query from the history database according to the rare disease-related query matched by the first-level keyword;

6. The method according to claim 1, wherein the method further comprises:

acquiring an initial keyword list;

7. The method of claim 6, wherein keyword expansion is performed on the initial keyword table to obtain the keyword table, and the method comprises:

processing each keyword in the initial keyword list to obtain a new keyword; the processing comprises at least one of symbol text, letter text, digital text, abbreviation text and preset characters for each keyword;

8. A device for predicting prevalence, the device comprising:

the classification module is used for inputting the session data set into a session classification model to perform session classification to obtain a classification result; the classification model is used for classifying the intention of the session;

and the prediction module is used for inputting the target session data set into a prediction model to perform prevalence prediction to obtain a prediction result.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.