CN111782821B - Medical hotspot prediction method and device based on FM model and computer equipment - Google Patents

Medical hotspot prediction method and device based on FM model and computer equipment Download PDF

Info

Publication number
CN111782821B
CN111782821B CN202010621766.7A CN202010621766A CN111782821B CN 111782821 B CN111782821 B CN 111782821B CN 202010621766 A CN202010621766 A CN 202010621766A CN 111782821 B CN111782821 B CN 111782821B
Authority
CN
China
Prior art keywords
medical
prediction
medical entity
model
entity names
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010621766.7A
Other languages
Chinese (zh)
Other versions
CN111782821A (en
Inventor
曹立宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010621766.7A priority Critical patent/CN111782821B/en
Priority to PCT/CN2020/118914 priority patent/WO2021139271A1/en
Publication of CN111782821A publication Critical patent/CN111782821A/en
Application granted granted Critical
Publication of CN111782821B publication Critical patent/CN111782821B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the field of artificial intelligence, and discloses a prediction method, a prediction device and a prediction computer device for medical hotspots based on an FM model, wherein the method is used for applying the FM model to the prediction field of medical research hotspots in the medical field for the first time, is suitable for processing sparse features, and can mine out the combination relation between the features. In addition, compared with knowledge graph prediction and SVD algorithm prediction, the prediction method of the medical hot spot based on the FM model can increase structural features, and the increase of the features is beneficial to the model to obtain better effects. The method of the application is used for counting popular research relations in the medical field and predicting research hotspots possibly occurring in the future based on the FM model, so that the entity most likely to be researched in the future of a certain disease can be predicted. On the one hand, the search of the current research content by doctors is facilitated, and on the other hand, the information of potential research hotspots is provided for the doctors. The application can also be applied in the field of blockchain, such as storing trained models in a blockchain network.

Description

Medical hotspot prediction method and device based on FM model and computer equipment
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, and a computer device for predicting a medical hotspot based on an FM model.
Background
The medical research hotspot prediction method is to construct knowledge maps and models based on a large amount of medical literature data using text mining, provide existing research relations, and then infer relations between biomedical concepts that may occur in the future. Prediction of future medical research hotspots is a valuable resource for researchers to explore research topics.
However, the inventors found that the number of literature publications in the medical field is rapidly increasing nowadays, and as the literature incidence increases, researchers have difficulty keeping pace with the moment of their associated study content, and thus have difficulty following and mining novel study content.
The existing prediction method of medical research hotspots has a knowledge-based method, but only utilizes local information of the map, so that the prediction result is not accurate enough. There are also SVD-based algorithms, but this approach is poorly interpretable and difficult to find suitable hyper-parameters, limiting in practical applications.
Disclosure of Invention
The main purpose of the application is to provide a medical hotspot prediction method, device and computer equipment based on an FM model, and aims to solve the technical problems of low prediction accuracy or poor interpretability in the prior art.
In order to achieve the above object, the present application proposes a method for predicting a medical hotspot based on an FM model, including:
acquiring two medical entity names to be predicted;
according to the two medical entity names and the characteristic format of a preset prediction model of a medical research hotspot, compiling prediction characteristics suitable for the prediction model, wherein the prediction model is a model obtained based on FM model training, the preset characteristics are sparse vectors, the value of the position corresponding to the medical entity name in the sparse vectors is 1, and the rest is 0;
inputting the prediction characteristics into the prediction model for calculation to obtain a prediction probability value, wherein the prediction probability value is used for representing the correlation between two medical entity names, and the larger the prediction probability value is, the stronger the correlation between the two medical entity names is;
judging whether the predicted probability value is larger than a preset threshold value or not;
if yes, judging that the two medical entity names are combined together to be a medical research hotspot.
Further, before the step of writing the prediction features applicable to the prediction model according to the two medical entity names and the feature format of the prediction model of the preset medical research hotspot, the method comprises the following steps:
acquiring literature data recorded with medical knowledge;
searching preset medical entity names in the literature data, and extracting preset association relations among the searched medical entity names in the literature data;
writing positive sample data according to the extracted medical entity names with association relations and the characteristic formats; constructing negative sample data of medical entity names which have the same format as positive sample data and have no association relation, wherein the construction of the negative samples adopts a mode of random combination and sampling among entities;
and training the model based on the FM model by utilizing the positive sample data and the negative sample data to obtain the prediction model for outputting the prediction probability value.
Further, the step of acquiring literature data bearing medical knowledge includes:
searching a medical paper website in the Internet;
if so, acquiring the establishment time and the accessed times of the medical paper website;
calculating the time length between the establishment time and the current time;
judging whether the accessed times are larger than a time threshold corresponding to the time length;
if so, the title and abstract of the paper are downloaded from the medical paper website and are used as the document data.
Further, the step of searching the preset medical entity names in the document data and extracting the preset association relationship between the searched medical entity names in the document data includes:
searching a preset abbreviation format in the abstract of the paper, and extracting the abbreviation name in the abbreviation format and the complete medical entity name corresponding to the abbreviation name before the abbreviation format;
replacing the abbreviated name in the paper with the full medical entity name;
searching the preset medical entity names in the abstract with the abbreviation name substitution, and extracting the medical entity names with preset association relations.
Further, the step of searching the preset medical entity names in the document data and extracting the preset association relationship between the searched medical entity names in the document data includes:
dividing the document data in sentence units;
extracting the names of the medical entities in each sentence;
if two kinds of medical entity lists appear in the same sentence, judging that the two medical entity lists in the sentence are in association relation;
if more than two medical entity names appear in the same sentence, taking a first medical entity name of a preset type as a main body, and respectively carrying out two-by-two group summation with other second medical entity names to obtain a plurality of groups of medical entity names with association relations.
Further, the step of extracting the name of the medical entity name in each sentence includes:
performing semantic coding on the characters in each sentence by utilizing a pre-training model BERT;
searching a first semantic code with similarity larger than a preset similarity threshold value/with maximum similarity with the semantic code of each preset medical entity name in the semantic codes;
and converting the name corresponding to the first semantic code into the name of the medical entity corresponding to the name.
The application also provides a prediction device of research hotspots based on the FM model, which comprises:
a first obtaining unit for obtaining two medical entity names to be predicted;
the programming unit is used for programming the prediction features suitable for the prediction models according to the two medical entity names and the feature format of the prediction models of the preset medical research hotspots, wherein the prediction models are models obtained based on FM model training, the preset features are sparse vectors, the value of the position corresponding to the medical entity name in the sparse vectors is 1, and the rest is 0;
the computing unit is used for inputting the prediction characteristics into the prediction model to calculate to obtain a prediction probability value, wherein the prediction probability value is used for representing the correlation between the two medical entity names, and the larger the prediction probability value is, the stronger the correlation between the two medical entity names is;
the judging unit is used for judging whether the predicted probability value is larger than a preset threshold value or not;
and the judging unit is used for judging that the two medical entity names are combined together to form a medical research hotspot if the predicted probability value is larger than a preset threshold value.
Further, the prediction device of research hotspots based on the FM model further comprises:
a second acquisition unit configured to acquire document data in which medical knowledge is recorded;
the searching and extracting unit is used for searching preset medical entity names in the literature data and extracting preset association relations among the searched medical entity names in the literature data;
generating a sample unit, which is used for writing positive sample data according to the extracted medical entity names with association relations and the characteristic formats; constructing negative sample data of medical entity names which have the same format as positive sample data and have no association relation, wherein the construction of the negative samples adopts a mode of random combination and sampling among entities;
and the training unit is used for training the model based on the FM model by utilizing the positive sample data and the negative sample data to obtain the prediction model.
The present application also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of any of the methods described above when the computer program is executed by the processor.
The present application also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of any of the above.
According to the prediction method, the prediction device and the computer equipment for the medical hot spot based on the FM model, the FM model is applied to the prediction field of the medical research hot spot in the medical field for the first time, the prediction method, the prediction device and the computer equipment are suitable for processing the features of sparse vectors, and the correlation between the features can be mined. In addition, compared with knowledge graph prediction and SVD algorithm prediction, the prediction method of the medical hot spot based on the FM model can increase structural features, and the increase of the features is beneficial to the model to obtain better effects. The method of the application is used for counting popular research relations in the medical field and predicting research hotspots possibly occurring in the future based on the FM model, so that the entity most likely to be researched in the future of a certain disease can be predicted. On the one hand, the search of the current research content by doctors is facilitated, and on the other hand, the information of potential research hotspots is provided for the doctors.
Drawings
FIG. 1 is a flowchart of a method for predicting medical hotspots based on an FM model according to an embodiment of the present application;
FIG. 2 is a block diagram schematically illustrating a structure of a prediction apparatus for medical hotspots based on an FM model according to an embodiment of the present application;
fig. 3 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present application.
The realization, functional characteristics and advantages of the present application will be further described with reference to the embodiments, referring to the attached drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
Referring to fig. 1, an embodiment of the present application provides a method for predicting a medical hotspot based on an FM model, including:
s1, acquiring two medical entity names to be predicted;
s2, compiling prediction features suitable for a prediction model according to the two medical entity names and a feature format of the prediction model of a preset medical research hotspot, wherein the prediction model is a model obtained based on FM model training, the preset features are sparse vectors, the value of the position corresponding to the medical entity name in the sparse vectors is 1, and the rest is 0;
s3, inputting the prediction characteristics into the prediction model for calculation to obtain a prediction probability value, wherein the prediction probability value is used for representing the correlation between two medical entity names, and the larger the prediction probability value is, the stronger the correlation between the two medical entity names is;
s4, judging whether the predicted probability value is larger than a preset threshold value or not;
and S5, if so, judging that the two medical entity names are combined together to form a medical research hotspot.
The execution subject of the embodiments of the present application may be a computer device such as a server having data processing capabilities.
As described in the above step S1, the medical entity name may include various categories of diseases, medicines, operations, examination and examination, genes, microorganisms, and immune factors. In the present embodiment, one of the two medical entity names is a medical entity name of a disease category, and the other is a medical entity name of another category.
As described in the above step S2, the prediction model is a model trained based on the FM (Factorization Machine) model, and the FM model can mine the correlation between the features by means of vector cross learning, which has the following two advantages: correlation among data features can be better mined under a highly sparse condition, especially for cross data which does not appear in training samples; the FM model can be completed in linear time both when calculating the objective function and when optimizing learning at random gradient descent. The feature format corresponding to the prediction model may include a plurality of modules, where the first module and the second module are independent thermal codes corresponding to the names of the medical entities, where the independent thermal codes are sparse vectors, where the value of the position corresponding to the name of the medical entity is 1, and the rest is 0, and the latter modules are vector codes set in various manners, specifically may be set according to actual needs, and will not be described again herein as long as they conform to the FM model. In this embodiment the first module is a single thermal encoding of a disease and the second module is a single thermal encoding of other kinds of medical entity names. After determining the feature format and two specific medical entity names, the prediction features applicable to the prediction model can be written.
As described in the above steps S3 to S5, the prediction feature is input into the prediction model to perform the prediction calculation process, so as to obtain a prediction probability value between 0 and 1, and whether the two medical entity names to be predicted are a medical research hotspot after being combined is determined according to the prediction probability value. The predetermined threshold is a manually set value, and may be an empirical value. The above-mentioned predictive probability value is used to represent the correlation between two medical entity names, and the larger the predictive probability value is, the stronger the correlation between two medical entity names is, and the higher the correctness of the two medical entity names in combination to form the current or future medical research hotspot is.
In one implementation, before the step S2 of writing the prediction features applicable to the prediction model according to the two medical entity names and the feature format of the prediction model of the preset medical research hotspot, the method includes:
acquiring literature data recorded with medical knowledge;
searching preset medical entity names in the literature data, and extracting preset association relations among the searched medical entity names in the literature data;
writing positive sample data according to the extracted medical entity names with association relations and the characteristic formats; constructing negative sample data of medical entity names which have the same format as positive sample data and have no association relation, wherein the construction of the negative samples adopts a mode of random combination and sampling among entities;
and training the model based on the FM model by utilizing the positive sample data and the negative sample data to obtain the prediction model for outputting the prediction probability value.
In this embodiment, the literature data of the medical knowledge is mainly a medical paper, and may be downloaded to a designated medical paper website, or may be patient treatment plan data prescribed by a doctor, where the patient treatment plan data may be downloaded to a database of each hospital, and the downloading of the data to the database of the hospital generally requires authorization of the hospital. The medical entity names are mainly extracted in a keyword retrieval mode, and in the extracting process, if two adjacent medical entity names reach a preset requirement, the two adjacent medical entity names are judged to have an association relation, and the preset requirement can be two medical entity names appearing in the same sentence or two medical entity names in the same sentence. In this embodiment, the positive sample data indicates that two medical entity names have an association relationship, and the negative sample data indicates that two medical entity names do not have an association relationship. The feature format used in the FM model described above includes a plurality of modules, specifically, the vector of the first module is a single-heat code of the name of the medical entity of the disease, the second module is a single-heat code of the name of other medical entities, the third module is a single-heat code representing the type represented by the single-heat code in the second module (for example, the single-heat code of the second module is 0100..the name of the medical entity representing the drug, the number 2 of the third module represents the drug, the vector of the third module is 2, etc.), the latter module can increase the historical published number of the disease, the information of influencing factors, the reference number, the information of increasing the upper and lower positions between the diseases, etc., and these features can improve the training effect on the FM model. The positive sample data is written according to the name of the medical entity with the association relation in the literature data of the medical knowledge, and the negative sample data can be constructed by adopting a mode of random combination and sampling among entities, namely, a large amount of irrelevant data are put into a database, then the data in the database are randomly combined, sampling is carried out at certain intervals, and the like, so as to obtain the negative sample data. In another embodiment, the medical entity names without association in the literature data using medical knowledge are written. In the present embodiment, the data amounts of the positive sample data and the negative sample are equal. In one specific diagram, the following is shown: each row is a feature, the first module is a part in a first frame (disease), the single thermal code corresponding to the medical entity name of the disease, the second module is a part in a second frame (entity), the single thermal code representing other medical entity names except for the medical entity name of the disease, and the single thermal code corresponding to other vectors in several frames respectively, such as the historical published number of the disease, and the like.
In one embodiment, the step of acquiring literature data recorded with medical knowledge includes:
searching a medical paper website in the Internet;
if so, acquiring the establishment time and the accessed times of the medical paper website;
calculating the time length between the establishment time and the current time;
judging whether the accessed times are larger than a time threshold corresponding to the time length;
if so, the title and abstract of the paper are downloaded from the medical paper website and are used as the document data.
In this embodiment, when acquiring document data, firstly, a medical paper website is searched on the internet, the implementation means is to traverse each website, then enter a homepage of each website to check website introduction, determine whether the website is a medical paper website according to the website introduction, specifically, perform semantic recognition on the website introduction, and when the acquired website is medical content and a website with paper downloading function is determined to be a medical paper website. In order to improve the credibility of the document data, the method for eliminating unqualified medical paper websites includes the steps of firstly obtaining the establishment time of the medical paper websites, then calculating the time length between the establishment time and the current time, searching a frequency threshold corresponding to the time length in a preset threshold list (a mapping table of the time length and the frequency threshold), and when the frequency of being accessed is larger than the frequency threshold, indicating that the searched medical paper websites are websites which are accessed more frequently than people, and meeting the credibility requirement of the document data. The papers published on the medical paper website basically have medical knowledge with research results and comparison fronts, but in view of the fact that the key points of the papers are mainly in the abstract and the text number of the whole papers is large, in order to improve the speed of extracting the names of medical entities later and reduce the data calculation amount, only the titles and abstract parts of the medical papers are downloaded. In another embodiment, the medical paper website is a designated paper website, and the medical paper website is not required to be searched in a whole network, but the document data is directly downloaded to the designated medical paper website.
In one embodiment, the step of searching the preset medical entity names in the document data and extracting the preset association relationship between the searched medical entity names in the document data includes:
searching a preset abbreviation format in the abstract of the paper, and extracting the abbreviation name in the abbreviation format and the complete medical entity name corresponding to the abbreviation name before the abbreviation format;
replacing the abbreviated name in the paper with the full medical entity name;
searching the preset medical entity names in the abstract with the abbreviation name substitution, and extracting the medical entity names with preset association relations.
In this embodiment, mainly for the abstract of the paper, the standard paper, if abbreviations appear, will be processed in the specified format, i.e. the first appearance of the full name is followed by a bracket, in which the abbreviations corresponding to the full names are given. In order to prevent the situation that the abbreviations are not extracted, firstly, a bracket is searched, then whether the words in front of the bracket are preset medical entity names or not is confirmed, if yes, the abbreviations are related to the preset medical entity names, then full text replacement is carried out, finally, extraction of the medical entity names is carried out, and accuracy and comprehensiveness of extraction are improved.
In one embodiment, the step of searching the preset medical entity names in the document data and extracting the preset association relationship between the searched medical entity names in the document data includes:
dividing the document data in sentence units;
extracting the names of the medical entities in each sentence;
if two kinds of the sentences appear in the same sentence, extracting two medical entity lists in the sentence as medical entity names with preset association;
if more than two medical entity names appear in the same sentence, taking a first medical entity name of a preset type as a main body, respectively carrying out two-by-two summation with other second medical entity names to obtain a plurality of groups of medical entity names with association relation, and extracting.
In this embodiment, both the Chinese literature data and the foreign literature data are standard, and only the names of the medical entities appearing in the same sentence are determined to have the preset association relationship. The sentence may be divided by identifying punctuation marks in the document data, for example, punctuation marks representing that a sentence is finished, such as a period, an exclamation mark, etc., in the sentence are detected, and then the sentence is divided. After the sentence division is completed, if only one medical entity name exists in one sentence, the medical entity name is ignored, and if two medical entity names exist in one sentence, the two adjacent medical entity names are judged to have a preset association relation. If a plurality of medical entity names appear in one sentence, such as three, and one of the medical entity names is a of a preset disease type, and the other two are medical entity names b and c of other types, the association relationship between a and b and the association relationship between a and c are obtained.
In one embodiment, the step of extracting the name of the medical entity name in each sentence includes:
performing semantic coding on the characters in each sentence by utilizing a pre-training model BERT;
searching a first semantic code with similarity larger than a preset similarity threshold value/with maximum similarity with the semantic code of each preset medical entity name in the semantic codes;
and converting the name corresponding to the first semantic code into the name of the medical entity corresponding to the name.
In this embodiment, the Pre-training model BERT is referred to as Pre-training of Deep Bidirectional TranSformerS for Language UnderStanding. Pre-training means BERT is a Pre-training model, and a large amount of prior language, syntax, word sense and other information are learned for downstream tasks through the unsupervised training of a large amount of corpus in the early stage. BidirectionA BiRT adopts a Bidirectional language model mode, so that knowledge of the context can be better fused. Briefly, BERT is a deep bi-directional pre-trained language understanding model with TranSformerS as feature extractor. The BERT learns rich linguistic information during the pre-training process. The semantic coding process is a process of vectorizing each sentence of characters. Each preset medical entity name also has a corresponding semantic code, then a first semantic code with the similarity larger than a preset similarity threshold value/and the maximum similarity with the semantic code of each preset medical entity name is searched in the semantic code of each sentence, and then the name corresponding to the first semantic code is converted into the medical entity name corresponding to the first semantic code (the corresponding medical entity name is the medical entity name corresponding to the semantic code with the similarity larger than the preset similarity threshold value and the maximum similarity with the first semantic code). According to the method and the device, the non-standard medical entity names can be extracted, the non-standard medical entity names are modified into correct medical entity names during extraction, and the calculation accuracy of the connection probability among the follow-up medical entity names is improved.
In this embodiment, the method for predicting medical hotspots based on the FM model may be applied in the blockchain field, where the foregoing prediction model, the pretrained model BERT, and the like are stored in the blockchain network. The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.
The blockchain underlying platform may include processing modules for user management, basic services, smart contracts, operation monitoring, and the like. The user management module is responsible for identity information management of all blockchain participants, including maintenance of public and private key generation (account management), key management, maintenance of corresponding relation between the real identity of the user and the blockchain address (authority management) and the like, and under the condition of authorization, supervision and audit of transaction conditions of certain real identities, and provision of rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node devices, is used for verifying the validity of a service request, recording the service request on a storage after the effective request is identified, for a new service request, the basic service firstly analyzes interface adaptation and authenticates the interface adaptation, encrypts service information (identification management) through an identification algorithm, and transmits the encrypted service information to a shared account book (network communication) in a complete and consistent manner, and records and stores the service information; the intelligent contract module is responsible for registering and issuing contracts, triggering contracts and executing contracts, a developer can define contract logic through a certain programming language, issue the contract logic to a blockchain (contract registering), invoke keys or other event triggering execution according to the logic of contract clauses to complete the contract logic, and simultaneously provide a function of registering contract upgrading; the operation monitoring module is mainly responsible for deployment in the product release process, modification of configuration, contract setting, cloud adaptation and visual output of real-time states in product operation, for example: alarms, monitoring network conditions, monitoring node device health status, etc.
The application is also operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
According to the prediction method of the medical hot spot based on the FM model, the FM model is applied to the prediction field of the medical research hot spot in the medical field for the first time, the method is suitable for processing sparse features, and the combination relation between the features can be excavated. In addition, compared with knowledge graph prediction and SVD algorithm prediction, the prediction method of the medical hot spot based on the FM model can increase structural features, and the increase of the features is beneficial to the model to obtain better effects. The method of the application is used for counting popular research relations in the medical field and predicting research hotspots possibly occurring in the future based on the FM model, so that the entity most likely to be researched in the future of a certain disease can be predicted. On the one hand, the search of the current research content by doctors is facilitated, and on the other hand, the information of potential research hotspots is provided for the doctors.
Referring to fig. 2, the present application further provides a prediction apparatus for research hotspots based on FM models, including:
a first obtaining unit 10 for obtaining two medical entity names to be predicted;
the writing unit 20 is configured to write a prediction feature applicable to a prediction model according to the two medical entity names and a feature format of the prediction model of a preset medical research hotspot, where the prediction model is a model obtained by training based on an FM model, the preset feature is a sparse vector, a value of a position corresponding to the medical entity name in the sparse vector is 1, and the rest is 0;
a calculating unit 30, configured to input the prediction feature into the prediction model to perform calculation, so as to obtain a prediction probability value, where the prediction probability value is used to represent a correlation between two medical entity names, and the greater the prediction probability value, the stronger the correlation between the two medical entity names is represented;
a judging unit 40, configured to judge whether the predicted probability value is greater than a preset threshold;
and the judging unit is used for judging that the two medical entity names are combined together to form a medical research hotspot if the predicted probability value is larger than a preset threshold value.
In an embodiment, the prediction apparatus for research hotspots based on the FM model further includes:
a second acquisition unit configured to acquire document data in which medical knowledge is recorded;
the searching and extracting unit is used for searching preset medical entity names in the literature data and extracting preset association relations among the searched medical entity names in the literature data;
generating a sample unit, which is used for writing positive sample data according to the extracted medical entity names with association relations and the characteristic formats; constructing negative sample data of medical entity names which have the same format as positive sample data and have no association relation, wherein the construction of the negative samples adopts a mode of random combination and sampling among entities;
a training unit for training the model based on the FM model by using the positive sample data and the negative sample data to obtain the prediction model for outputting the prediction probability value
In one embodiment, the second obtaining unit includes:
the searching module is used for searching medical paper websites in the Internet;
the acquisition module is used for acquiring the establishment time and the accessed times of the medical paper website if the medical paper website is found;
the calculation module is used for calculating the time length between the establishment time and the current time;
the judging module is used for judging whether the accessed times are larger than a time threshold corresponding to the time length;
and the downloading module is used for downloading titles and abstracts of the papers from the medical paper website and taking the titles and abstracts as the document data.
In one embodiment, the search extraction unit includes:
the first searching and extracting module is used for searching a preset abbreviation format in the abstract of the paper and extracting abbreviation names in the abbreviation format and complete medical entity names corresponding to the abbreviation names before the abbreviation format;
a replacement module for replacing the abbreviated name in the paper with the full medical entity name;
and the second searching and extracting module is used for searching the preset medical entity names in the abstract with the abbreviation name replaced and extracting the medical entity names with the preset association relation.
In one embodiment, the search extraction unit includes:
the division module is used for dividing the document data in sentence units;
the extraction module is used for extracting the names of the medical entities in each sentence;
the first execution module is used for extracting two medical entity lists in the sentence as medical entity names with preset association if two types appear in the same sentence;
and the second execution module is used for taking a first medical entity name of a preset type as a main body if more than two medical entity names appear in the same sentence, respectively carrying out two-by-two group summation with other second medical entity names to obtain a plurality of groups of medical entity names with association relations, and extracting.
In one embodiment, the extracting module includes:
the coding sub-module is used for carrying out semantic coding on the characters in each sentence by utilizing the pre-training model BERT;
the similarity calculation submodule is used for searching a first semantic code with similarity larger than a preset similarity threshold value/with maximum similarity with the semantic code of each preset medical entity name in the semantic codes;
and the conversion sub-module is used for converting the name corresponding to the first semantic code into the name of the medical entity corresponding to the name.
The units, modules, sub-modules, and the like are devices for performing the method for predicting medical hotspots based on the FM model, and will not be described in detail.
Referring to fig. 3, a computer device is further provided in the embodiment of the present application, where the computer device may be a server, and the internal structure of the computer device may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data such as document data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program when executed by a processor implements the method of predicting research hotspots based on the FM model described in any of the embodiments above.
Those skilled in the art will appreciate that the architecture shown in fig. 3 is merely a block diagram of a portion of the architecture in connection with the present application and is not intended to limit the computer device to which the present application is applied.
The embodiment of the application further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the method for predicting research hotspots based on the FM model described in any of the above embodiments.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by hardware associated with a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims (6)

1. A method for predicting a medical hotspot based on an FM model, comprising:
acquiring two medical entity names to be predicted;
according to the two medical entity names and the feature format of a preset prediction model of a medical research hotspot, compiling prediction features suitable for the prediction model, wherein the prediction model is a model obtained based on FM model training, the prediction features are sparse vectors, the value of the position corresponding to the medical entity name in the sparse vectors is 1, and the rest is 0;
inputting the prediction characteristics into the prediction model for calculation to obtain a prediction probability value, wherein the prediction probability value is used for representing the correlation between two medical entity names, and the larger the prediction probability value is, the stronger the correlation between the two medical entity names is;
judging whether the predicted probability value is larger than a preset threshold value or not;
if yes, judging that the two medical entity names are combined together to form a medical research hotspot;
the step of writing the prediction features applicable to the prediction model according to the two medical entity names and the feature format of the prediction model of the preset medical research hotspot comprises the following steps:
acquiring literature data recorded with medical knowledge;
searching preset medical entity names in the literature data, and extracting preset association relations among the searched medical entity names in the literature data;
writing positive sample data according to the extracted medical entity names with association relations and the characteristic formats; constructing negative sample data of medical entity names which have the same format as positive sample data and have no association relation, wherein the construction of the negative samples adopts a mode of random combination and sampling among entities;
training the model based on the FM model by utilizing the positive sample data and the negative sample data to obtain the prediction model for outputting a prediction probability value;
the step of acquiring literature data bearing medical knowledge includes:
searching a medical paper website in the Internet;
if so, acquiring the establishment time and the accessed times of the medical paper website;
calculating the time length between the establishment time and the current time;
judging whether the accessed times are larger than a time threshold corresponding to the time length;
if yes, downloading titles and abstracts of papers from a medical paper website, and taking the titles and abstracts as the literature data;
the step of searching the preset medical entity names in the literature data and extracting the preset association relation of the searched medical entity names in the literature data comprises the following steps:
searching a preset abbreviation format in the abstract of the paper, and extracting the abbreviation name in the abbreviation format and the complete medical entity name corresponding to the abbreviation name before the abbreviation format;
replacing the abbreviated name in the paper with the full medical entity name;
searching the preset medical entity names in the abstract with the abbreviation name substitution, and extracting the medical entity names with preset association relations.
2. The method for predicting medical hotspots based on the FM model according to claim 1, wherein the step of searching for a preset medical entity name in the literature data and extracting a preset association relationship between each searched medical entity name in the literature data comprises the steps of:
dividing the document data in sentence units;
extracting the names of the medical entities in each sentence;
if two medical entity names appear in the same sentence, extracting the two medical entity names in the sentence as medical entity names with preset association;
if more than two medical entity names appear in the same sentence, taking a first medical entity name of a preset type as a main body, respectively carrying out two-by-two summation with other second medical entity names to obtain a plurality of groups of medical entity names with association relations, and extracting.
3. The method for predicting a medical hotspot based on an FM model of claim 2, wherein the step of extracting the names of the medical entities in each sentence comprises:
semantic coding is carried out on the characters in each sentence by utilizing a pre-training model BERT;
searching a first semantic code with the similarity larger than a preset similarity threshold and the maximum similarity from the semantic codes;
and converting the name corresponding to the first semantic code into the name of the medical entity corresponding to the name.
4. A prediction apparatus for medical research hotspots based on an FM model, for implementing the prediction method for medical hotspots based on an FM model as claimed in any one of claims 1 to 3, comprising:
a first obtaining unit for obtaining two medical entity names to be predicted;
the programming unit is used for programming the prediction features applicable to the prediction models according to the two medical entity names and the feature format of the prediction models of the preset medical research hotspots, wherein the prediction models are models obtained based on FM model training, the prediction features are sparse vectors, the values of the positions corresponding to the medical entity names in the sparse vectors are 1, and the rest are 0;
the computing unit is used for inputting the prediction characteristics into the prediction model to calculate to obtain a prediction probability value, wherein the prediction probability value is used for representing the correlation between the two medical entity names, and the larger the prediction probability value is, the stronger the correlation between the two medical entity names is;
the judging unit is used for judging whether the predicted probability value is larger than a preset threshold value or not;
the judging unit is used for judging that the two medical entity names are combined together to form a medical research hotspot if the predicted probability value is larger than a preset threshold value;
a second acquisition unit configured to acquire document data in which medical knowledge is recorded;
the searching and extracting unit is used for searching preset medical entity names in the literature data and extracting preset association relations among the searched medical entity names in the literature data;
generating a sample unit, which is used for writing positive sample data according to the extracted medical entity names with association relations and the characteristic formats; constructing negative sample data of medical entity names which have the same format as positive sample data and have no association relation, wherein the construction of the negative samples adopts a mode of random combination and sampling among entities;
and the training unit is used for training the model based on the FM model by utilizing the positive sample data and the negative sample data to obtain the prediction model for outputting the prediction probability value.
5. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 3 when the computer program is executed.
6. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 3.
CN202010621766.7A 2020-06-30 2020-06-30 Medical hotspot prediction method and device based on FM model and computer equipment Active CN111782821B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010621766.7A CN111782821B (en) 2020-06-30 2020-06-30 Medical hotspot prediction method and device based on FM model and computer equipment
PCT/CN2020/118914 WO2021139271A1 (en) 2020-06-30 2020-09-29 Fm model based method and apparatus for predicting medical hot spot, and computer device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010621766.7A CN111782821B (en) 2020-06-30 2020-06-30 Medical hotspot prediction method and device based on FM model and computer equipment

Publications (2)

Publication Number Publication Date
CN111782821A CN111782821A (en) 2020-10-16
CN111782821B true CN111782821B (en) 2023-12-19

Family

ID=72761426

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010621766.7A Active CN111782821B (en) 2020-06-30 2020-06-30 Medical hotspot prediction method and device based on FM model and computer equipment

Country Status (2)

Country Link
CN (1) CN111782821B (en)
WO (1) WO2021139271A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113312268A (en) * 2021-07-29 2021-08-27 北京航空航天大学 Intelligent contract code similarity detection method
CN114218361A (en) * 2021-11-12 2022-03-22 杭州未名信科科技有限公司 Medical path recommendation method and system based on medical research literature

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214245A (en) * 2011-07-12 2011-10-12 厦门大学 Graph theory analysis method of research hot spots based on co-occurrence of keywords
CN108614867A (en) * 2018-04-12 2018-10-02 科技部科技评估中心 Frontline technology sex index computational methods based on scientific paper and system
CN110322323A (en) * 2019-07-02 2019-10-11 拉扎斯网络科技(上海)有限公司 Entity methods of exhibiting, device, storage medium and electronic equipment
CN110555103A (en) * 2019-07-22 2019-12-10 中国人民解放军总医院 Construction method and device of biomedical entity display platform and computer equipment
CN111047406A (en) * 2019-12-12 2020-04-21 北京思特奇信息技术股份有限公司 Telecommunication package recommendation method, device, storage medium and equipment
CN111291568A (en) * 2020-03-06 2020-06-16 西南交通大学 Automatic entity relationship labeling method applied to medical texts

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8566360B2 (en) * 2010-05-28 2013-10-22 Drexel University System and method for automatically generating systematic reviews of a scientific field
EP2469421A1 (en) * 2010-12-23 2012-06-27 British Telecommunications Public Limited Company Method and apparatus for processing electronic data
US10516906B2 (en) * 2015-09-18 2019-12-24 Spotify Ab Systems, methods, and computer products for recommending media suitable for a designated style of use
US11049041B2 (en) * 2018-04-26 2021-06-29 Adobe Inc. Online training and update of factorization machines using alternating least squares optimization
US11250347B2 (en) * 2018-06-27 2022-02-15 Microsoft Technology Licensing, Llc Personalization enhanced recommendation models
CN109670054B (en) * 2018-12-26 2020-11-10 医渡云(北京)技术有限公司 Knowledge graph construction method and device, storage medium and electronic equipment
CN111191136A (en) * 2019-12-30 2020-05-22 华为技术有限公司 Information recommendation method and related equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214245A (en) * 2011-07-12 2011-10-12 厦门大学 Graph theory analysis method of research hot spots based on co-occurrence of keywords
CN108614867A (en) * 2018-04-12 2018-10-02 科技部科技评估中心 Frontline technology sex index computational methods based on scientific paper and system
CN110322323A (en) * 2019-07-02 2019-10-11 拉扎斯网络科技(上海)有限公司 Entity methods of exhibiting, device, storage medium and electronic equipment
CN110555103A (en) * 2019-07-22 2019-12-10 中国人民解放军总医院 Construction method and device of biomedical entity display platform and computer equipment
CN111047406A (en) * 2019-12-12 2020-04-21 北京思特奇信息技术股份有限公司 Telecommunication package recommendation method, device, storage medium and equipment
CN111291568A (en) * 2020-03-06 2020-06-16 西南交通大学 Automatic entity relationship labeling method applied to medical texts

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Deep Factorization Machines for Knowledge Tracing;Jill-Jˆenn Vie;《arXiv:1805.00356v1 [cs.IR]》;第1-4页 *

Also Published As

Publication number Publication date
CN111782821A (en) 2020-10-16
WO2021139271A1 (en) 2021-07-15

Similar Documents

Publication Publication Date Title
Fan et al. Adverse drug event detection and extraction from open data: A deep learning approach
CN112016279B (en) Method, device, computer equipment and storage medium for structuring electronic medical record
JP2021532499A (en) Machine learning-based medical data classification methods, devices, computer devices and storage media
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
CN110909137A (en) Information pushing method and device based on man-machine interaction and computer equipment
CN111078875B (en) Method for extracting question-answer pairs from semi-structured document based on machine learning
US11449533B2 (en) Curating knowledge for storage in a knowledge database
CN111651992A (en) Named entity labeling method and device, computer equipment and storage medium
US20210217504A1 (en) Method and apparatus for verifying medical fact
CN111858940B (en) Multi-head attention-based legal case similarity calculation method and system
CN111782821B (en) Medical hotspot prediction method and device based on FM model and computer equipment
CN111710383A (en) Medical record quality control method and device, computer equipment and storage medium
CN113724819B (en) Training method, device, equipment and medium for medical named entity recognition model
CN113657105A (en) Medical entity extraction method, device, equipment and medium based on vocabulary enhancement
CN114298035A (en) Text recognition desensitization method and system thereof
Zhang et al. VetTag: improving automated veterinary diagnosis coding via large-scale language modeling
CN113707299A (en) Auxiliary diagnosis method and device based on inquiry session and computer equipment
CN113409907A (en) Intelligent pre-inquiry method and system based on Internet hospital
CN113821587B (en) Text relevance determining method, model training method, device and storage medium
US11900059B2 (en) Method, apparatus and computer program product for generating encounter vectors and client vectors using natural language processing models
Peng et al. Pattern filtering attention for distant supervised relation extraction via online clustering
Sathyendra et al. Helping users understand privacy notices with automated query answering functionality: An exploratory study
CN112017735B (en) Drug discovery method, device and equipment based on relation extraction and knowledge reasoning
CN113010771A (en) Training method and device for personalized semantic vector model in search engine
CN116227594A (en) Construction method of high-credibility knowledge graph of medical industry facing multi-source data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40033517

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant