CN113076432B - Literature knowledge context generation method, device and storage medium - Google Patents

Literature knowledge context generation method, device and storage medium Download PDF

Info

Publication number
CN113076432B
CN113076432B CN202110480081.XA CN202110480081A CN113076432B CN 113076432 B CN113076432 B CN 113076432B CN 202110480081 A CN202110480081 A CN 202110480081A CN 113076432 B CN113076432 B CN 113076432B
Authority
CN
China
Prior art keywords
entity
acquiring
document
standard
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110480081.XA
Other languages
Chinese (zh)
Other versions
CN113076432A (en
Inventor
林桂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110480081.XA priority Critical patent/CN113076432B/en
Publication of CN113076432A publication Critical patent/CN113076432A/en
Application granted granted Critical
Publication of CN113076432B publication Critical patent/CN113076432B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/381Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using identifiers, e.g. barcodes, RFIDs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Abstract

The invention relates to artificial intelligence, and discloses a literature knowledge context generation method, which comprises the following steps: classifying labels of documents to be detected, and acquiring a class label set corresponding to the documents to be detected; acquiring query information, and acquiring a target document range corresponding to the query information in the document to be detected based on the query information; meanwhile, extracting the entity of the target document in the target document range to obtain all standard entity names in the target document; based on the standard entity index and the class label set, acquiring a class label and the standard entity index set corresponding to the target document; and forming a literature knowledge context corresponding to the query information based on the category label and the standard entity designation set. The invention can complete the knowledge context combing of the related documents, and further can recommend corresponding contents for the user according to the knowledge context combing and the user expectation for guiding.

Description

Literature knowledge context generation method, device and storage medium
Technical Field
The present invention relates to the field of artificial intelligence, and in particular, to a method, an apparatus, an electronic device, and a computer readable storage medium for generating a literature knowledge context.
Background
At present, a self-service scientific research information service platform developed for college staff provides information mining and analysis service based on literature for researchers. By utilizing the service, researchers can deeply and comprehensively understand the current research situation of the attention problem, complete the research data extraction of the experts and research institutions in the specific field, and grasp the latest dynamics of disciplines and the research hotspots of funds. For example, aminer is independently developed by the Qinghua university, and provides the functions of extracting semantic information, discovering topics, analyzing trends and the like of researchers by utilizing data mining and social network analysis and mining technologies, and provides comprehensive domain knowledge, targeted research topics and partner information for the researchers.
However, most of the existing scientific research information service platforms only support Chinese literature analysis and interpretation, the Pubmed literature is not fully recorded, the computer field is generally focused, and meanwhile, the literature research hotspots are not deeply mined. In general, the existing domestic products have different degrees of functional defects in academic excavation and scholars searching, and the more obvious and common problem is that the existing domestic products are not only aimed at documents in the medical field, so that the verticality is insufficient, and the lack of the professional performance in the excavation and research in the medical field is unavoidable.
Disclosure of Invention
The invention provides a literature knowledge context generation method, a device, an electronic device and a computer readable storage medium, and mainly aims to provide a reliable scheme for generating professional literature knowledge context of medicine and the like.
In order to achieve the above object, the present invention provides a method for generating a context of literature knowledge, comprising:
Classifying labels of documents to be detected, and acquiring a class label set corresponding to the documents to be detected;
acquiring query information, and acquiring a target document range corresponding to the query information in the document to be detected based on the query information and the category label set;
extracting the entity of the target document in the target document range to obtain all standard entity names in the target document;
based on the standard entity index and the class label set, acquiring a class label and a standard entity index set corresponding to the target document;
and forming literature knowledge venation corresponding to the query information based on the category labels and the standard entity reference set.
Optionally, the step of obtaining all standard entity designations in the target document includes:
acquiring all entity references corresponding to the target document based on a pre-trained entity identification model;
and linking the entity references to the standard atlas based on an entity linking technology, and acquiring standard entity references corresponding to the entity references.
Optionally, the step of obtaining a standard entity reference corresponding to the entity reference includes:
Acquiring a synonymous information item corresponding to the entity reference based on the entity reference, and determining a reference item set based on the entity reference and the synonymous information item;
searching a candidate entity item set corresponding to the index item set in a preset knowledge base based on the index item set;
extracting dimension reduction features of the reference item set and the candidate entity item set respectively;
Performing similarity calculation on the dimension reduction features of the index item set and the candidate entity item set, and sorting all entities in the candidate entity item set according to the score obtained by the similarity calculation;
And determining an entity set corresponding to the entity index based on the sorting result, wherein the entity in the entity set is used as the standard entity index.
Optionally, the extracting the dimension reduction features of the finger term set and the candidate entity term set respectively includes:
Acquiring Word2Vec values of all entities in the reference item set and the candidate entity item set;
Based on the Word2Vec value, acquiring a TF-IDF value of the entity corresponding to the Word2Vec value;
multiplying the TF-IDF value as a weight by a word vector of the entity to obtain the dimension reduction features of the term set and the candidate entity term set.
Optionally, the step of classifying the labels of the documents to be detected and acquiring the class label set corresponding to the documents to be detected includes:
acquiring literature data with classification labels as a training data set;
Training an MLG-Bert model based on the training data until the MLG-Bert model is converged within a preset range to form a document classification model;
and acquiring a category label set corresponding to the document to be detected based on the document classification model.
Optionally, the formula of multiplying the TF-IDF value as a weight with the word vector of the entity is expressed as:
doc_emb=∑TF-IDF('wordi)·Word2vec(wordi)
Wherein doc_emb represents the dimension reduction feature of the index item set/the candidate entity item set, word i represents the ith entity in the index item set/the candidate entity item set, TF-IDF represents the TF-IDF value of the ith entity, and Word2Vec represents the Word2Vec Word vector of the ith entity.
In order to solve the above-mentioned problems, the present invention also provides a literature knowledge context generating apparatus, the apparatus comprising:
the category label set acquisition unit is used for carrying out label classification on the to-be-detected documents and acquiring category label sets corresponding to the to-be-detected documents;
The target document range acquisition unit is used for acquiring query information and acquiring a target document range corresponding to the query information in the document to be detected based on the query information and the category label set;
the standard entity index acquisition unit is used for extracting the entity of the target document in the target document range so as to acquire all standard entity indexes in the target document;
the category label and standard entity reference set acquisition unit is used for acquiring a category label and a standard entity reference set corresponding to the target document based on the standard entity reference and the category label set;
And the literature knowledge context forming unit is used for forming literature knowledge context corresponding to the query information based on the category label and the standard entity reference set.
Optionally, the step of obtaining all standard entity designations in the target document includes:
acquiring all entity references corresponding to the target document based on a pre-trained entity identification model;
and linking the entity references to the standard atlas based on an entity linking technology, and acquiring standard entity references corresponding to the entity references.
In order to solve the above-mentioned problems, the present invention also provides an electronic apparatus including:
At least one processor; and
A memory communicatively coupled to the at least one processor; wherein,
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps in the above-described document knowledge context generation method.
In order to solve the above-described problems, the present invention also provides a computer-readable storage medium having stored therein at least one instruction that is executed by a processor in an electronic device to implement the above-described literature knowledge context generation method.
According to the embodiment of the invention, the documents to be detected are subjected to label classification, and a class label set corresponding to the documents to be detected is obtained; acquiring a target document range corresponding to query information in the document to be detected based on the query information; extracting the entity of the target document to obtain all standard entity references in the target document; based on the standard entity index and the class label set, acquiring a class label and a standard entity index set corresponding to the target document; and forming literature knowledge context corresponding to the query information based on the category labels and the standard entity reference set, mining and understanding massive medical and other types of literature through artificial intelligence and natural language processing technology, providing scientific research knowledge context service for researchers, acquiring category labels and entity reference sets of the literature based on named entity recognition extraction, literature multi-label classification, entity recommendation and other underlying algorithm technologies, providing expected knowledge context navigation for users according to the category labels and the entity reference sets, covering literature-entity, displaying trend from surface to point, and being more convenient for users to systematically and generally know the research field.
Drawings
FIG. 1 is a flow chart of a method for generating context of document knowledge according to an embodiment of the invention;
FIG. 2 is a schematic block diagram of a document knowledge context generating apparatus according to an embodiment of the present invention;
fig. 3 is a schematic diagram of an internal structure of an electronic device for implementing a method for generating a context of literature knowledge according to an embodiment of the present invention;
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The invention provides a literature knowledge context generation method. Referring to fig. 1, a flow chart of a method for generating a context of literature knowledge according to an embodiment of the present invention is shown. The method may be performed by an apparatus, which may be implemented in software and/or hardware.
In this embodiment, the literature knowledge context generation method includes: classifying labels of documents to be detected, and acquiring a class label set corresponding to the documents to be detected; acquiring query information, and acquiring a target document range corresponding to the query information in the document to be detected based on the query information and the category label set; extracting the entity of the target document in the target document range to obtain all standard entity names in the target document; based on the standard entity index and the class label set, acquiring a class label and a standard entity index set corresponding to the target document; and forming literature knowledge venation corresponding to the query information based on the category labels and the standard entity reference set.
Specifically, the steps of the above-described literature knowledge context generation method are described in detail below.
S110: and classifying the labels of the documents to be detected, and acquiring a class label set corresponding to the documents to be detected.
The method for classifying the labels of the documents to be detected and obtaining the class label sets corresponding to the documents to be detected comprises the following steps:
s111: acquiring literature data with classification labels as a training data set;
S112: training an MLG-Bert model based on the training data until the MLG-Bert model is converged within a preset range to form a document classification model;
S113: and acquiring a category label set corresponding to the document to be detected based on the document classification model.
Specifically, documents may be pre-labeled in categories based on Pubmed supported medical subject words Mesh (Medical Subject Headings, mesh, a tool widely used in medical information retrieval), for example, may be classified into 3 categories of primary labels: basic, diagnosis and treatment, and constructing more than 20 secondary labels on the basis of primary classification. Therefore, more than 1000 tens of thousands of data with classification labels can be obtained as a training set, and the model architecture of Bert+GCN is adopted to predict unlabeled data, so that the literature class labels of the full pubmed literature are constructed. The to-be-detected documents do not have classification labels, category labels corresponding to the to-be-detected documents need to be acquired through a document classification model, the category labels can comprise a primary label, a secondary label, a tertiary label and the like, each label further comprises types of labels, such as basic, diagnosis, treatment and the like, all the category labels of the to-be-detected documents can be acquired through detection, and a category label set corresponding to the to-be-detected documents is formed according to the category labels. The steps S111 to S113 are mainly a training process of the document classification model, in which the input of the model is document data with classification labels (and labels or manual labels thereof), and then the output data is prediction classification labels of corresponding documents, and the training result of the model can be judged based on the prediction classification labels and the originally labeled classification labels until the accuracy meets the requirements. It can be known that the training by using the MLG-Bert model is not the only method for obtaining the class label set of the document to be detected, but can also be performed by using other models to obtain the wanted class label.
As an example, the category label set includes the results of label classification of all documents to be detected, namely, the corresponding category labels are obtained, and further the category labels at least include a first-level label, a second-level label and a third-level label; the primary label at least comprises a basis, a diagnosis and a treatment, the secondary label at least comprises a drug treatment, a surgical treatment, an interventional treatment, a general treatment, other treatments, prognosis and the like, and the application is not limited by the grade and the number of the class labels. S120: acquiring query information, and acquiring a target document range corresponding to the query information in the document to be detected based on the query information and the category label set;
In particular, the query information may be understood as related information of documents that the user needs to search, and the query information may be various information such as a summary, a title, or other keywords. Because the documents to be detected are already subjected to label classification and the corresponding category labels are obtained, in the process of determining the range of the target documents, the category labels can be judged and screened according to specific query information input by a user in the documents to be detected through the query information so as to obtain the corresponding range of the target documents, and the range of the target documents can comprise a preset number of target documents and can be specifically set according to requirements.
In addition, in the process of screening the category labels (corresponding to the documents to be detected) by querying the information, screening can be performed by presetting a certain judgment rule or performing a plurality of modes such as similarity calculation, and the invention is not particularly limited.
S130: and extracting medical named entity from the target documents in the range of the target documents to obtain all standard entity references in the target documents.
In this step, the process of obtaining all standard entity designations in the target document further comprises the steps of:
s131: acquiring all entity names corresponding to the document to be detected based on a pre-trained entity identification model;
s132: and linking the entity references to the standard atlas based on an entity linking technology, and acquiring standard entity references corresponding to the entity references.
As a specific example, the safe medical knowledge graph is a product that associates a medical field multidimensional database through knowledge graph technology, and provides a user with massive specialized medical knowledge. The method integrates 100 ten thousand core medical terms, 1000 ten thousand medical terms and 1600 ten thousand medical relations, achieves the comprehensive knowledge data aggregation in the medical ecological circle, covers the core medical concepts of diseases, medicines, inspection, examination, operation, genes, departments and the like, and provides personalized solutions based on accurate medical knowledge for each role in a clinical path.
In addition, the entity recognition model can be set as a medical named entity recognition model, and a BioBERT-based deep learning model is adopted. BioBERT is a language pre-training model of the disclosed medical biotechnology field, which contains information of tens of millions of medical biotechnology field and general field literature. In the invention, the shell encodes the basic semantic information of the text through BioBERT, then learns the characteristics of the task through a bidirectional LSTM layer, and finally optimizes the entity sequence through a CRF layer. Through the model, entity names in the text of the document to be detected can be acquired, and then the entity names can be linked to a standard map concept through an entity linking technology based on text similarity.
For example, the text in the acquisition target literature (which may also be referred to as the literature to be examined, hereinafter, the same) is "heart transplantation", and since "heart transplantation" is not a standard term, the standard on the standard map belongs to "heart transplantation", and for this reason "heart transplantation" can be linked to "heart transplantation" by a physical reference link.
Specifically, in the document knowledge context generating method of the present invention, the query information may be the Title and Abstract of an article, and the document to be detected is predicted by using an MLG-Bert model, for example, the Title and Abstract of the article may be input by using the MLG-Bert model, title represents the Title, abstract represents the Abstract, and bioBERT of the model generates an overall vector representation, wherein bioBERT adopts the corpus of biomedical text for pre-training, and compared with the Bert pre-training model using a general corpus, the negative influence caused by word distribution offset can be reduced. The CNN (convolutional neural network) layer is added after BioBert of the model, so that characteristics can be better extracted and combined, and finally category labels Labels of documents are output through Dot multiplication. The GCN layer is then added as an embedded network layer of labels that enhances the non-linearity capability of the model by embedding value inputs of node features.
In particular, the model may be trained using binary cross entropy as a penalty of bioBert. Finally, the model is used for predicting the articles without the mesh term, and the articles are also classified into a classification system for subsequent document analysis.
Specifically, the step of obtaining a standard entity reference corresponding to the entity reference includes:
s1321: acquiring a synonymous information item corresponding to the entity reference based on the entity reference, and determining a reference item set based on the entity reference and the synonymous information item;
s1322: searching a candidate entity item set (entity recall) corresponding to the index item set in a preset knowledge base based on the index item set;
S1323: extracting dimension reduction features of the reference item set and the candidate entity item set respectively;
The process of extracting the dimension reduction features of the index item set and the candidate entity item set is a process of respectively carrying out dimension reduction processing on the index item set and the candidate entity item set, so that the corresponding dimension reduction features can be obtained, subsequent similarity calculation can be facilitated, and the calculation process is simplified.
In addition, the finger term set and the candidate entity term set may be subjected to dimension reduction processing in other dimension reduction modes, for example, multiple dimension reduction modes such as filtering, random forest, principal component analysis, reverse feature elimination, etc., which are not limited in the present invention.
S1324: performing similarity calculation on the dimension reduction features of the index item set and the candidate entity item set, and sequencing all entities in the candidate entity item set according to scores obtained by the similarity calculation;
In this step, the calculation formula of the similarity is as follows:
Wherein x and y represent entity vector characterizations of different word vectors, respectively, the entity vector characterizations include surface features and deep features, all entities are ranked according to the score of the similarity, and the higher the score is, the higher the ranking of the corresponding entities is.
S1325: and determining an entity set corresponding to the entity index based on the sorting result, wherein the entity in the entity set is used as the standard entity index.
In this step, the foregoing preset number of entities of the similarity score may be taken to form the entity set, where the entities in the entity set are referred to as the standard entities, the preset number may be set according to the requirement, and in the present invention, the preset number may be set to 5.
Specifically, the entity link (ENTITY LINKING, EL) mainly refers to a process of unambiguously and correctly pointing to a target entity in a knowledge base by using an entity object (such as a person name, a place name, an organization name, etc.) identified in a free text. In popular terms, entity linking mainly refers to predicting that a certain entity of an input query corresponds to a knowledge base id in the case that a knowledge base is already available. The method mainly comprises two parts of content including entity recall and entity ordering.
The two most prominent steps in the entity linking technique described above are entity recall and entity ordering. The generation of entity recall, namely candidate entity set, is to recall as many entities related to the entity recall in the knowledge base according to the existing index items in the text of the document to be detected, and the process requires higher recall rate. Specifically, the cosine similarity between the word vector of the term and the word vector in the text can be calculated according to the text training word vector, for example, the threshold value can be set to be about 0.56, and the synonym of the term can be calculated after the threshold value is larger than the threshold value.
Entity ranking is mainly to rank candidate entity item sets with evidence of no use of category to obtain the most probable entity, for example, in the process of extracting the dimensionality reduction features of the reference item set and the candidate entity item set: word2Vec values of all entities in the reference item set and the candidate entity item set can be respectively obtained; based on a Word2Vec value, acquiring a TF-IDF value of the entity corresponding to the Word2Vec value; multiplying the TF-IDF value as a weight with the word vector of the entity to obtain the dimension reduction feature of the entity in the reference item set and the candidate entity item set.
Wherein Word2Vec represents a Word vector corresponding to an entity, TF-IDF value represents Word frequency-inverse document frequency, and a formula for multiplying the TF-IDF value as a weight by the Word vector of the entity is expressed as:
doc_emb=∑TF-IDF(wordi)·Word2vec(wordi)
Wherein doc_emb represents the dimension reduction feature of the index item set/the candidate entity item set, word i represents the ith entity in the index item set/the candidate entity item set, TF-IDF represents the TF-IDF value of the ith entity, and Word2Vec represents the Word2Vec Word vector of the ith entity.
S140: based on the standard entity index and the class label set, acquiring a class label and a standard entity index set corresponding to the target document;
S150: and forming literature knowledge venation corresponding to the query information based on the category labels and the standard entity reference set.
In the above steps S140 and S150, after determining the entity designations of the target documents within the target document range, the corresponding standard entity designations can be determined based on all the entity designations of the target documents, and then, according to the category labels of the target documents corresponding to the user query information, all the category labels and all the standard entity designations corresponding to the reference documents can be obtained from the category label set formed by the documents to be detected, and further, the category label (set) and the standard entity designation set corresponding to the target documents can be formed.
Furthermore, according to the category labels and the standard entity index set, the category labels can be subjected to label classification, for example, the category labels comprise a first-level label, a second-level label, a third-level label and the like, standard entity indexes are further classified under each label level, for example, the category labels are further classified into four entity indexes under the first-level label, the category labels are further classified into a plurality of entity indexes under the second-level label, and the like until all the category labels and the standard entity indexes are classified, and document knowledge venues corresponding to the query information are formed.
It should be noted that, the above-mentioned classification of the labels of the category labels may be performed by a preset rule or a conventional label classification method, and the classification method is not particularly limited.
It can be known that by the document knowledge context generation method provided by the invention, classification of documents and entity extraction of classified documents can be realized, and finally, several entities which are most suitable for user expectation are recommended in each document category for user navigation to form knowledge context. The invention provides knowledge context navigation which accords with expectations for users by means of bottom algorithm technologies such as named entity recognition and extraction, document multi-label classification, entity recommendation and the like, covers document-entity, displays trend from surface to point, is more convenient for users to systematically and generally know the field to be researched, can excavate and understand massive medical documents by artificial intelligence and natural language processing technology, and provides scientific research knowledge context service for researchers.
Corresponding to the literature knowledge context generation method, the invention further provides a literature knowledge context generation device.
Fig. 2 shows a functional block diagram of the inventive document knowledge context generating apparatus.
As shown in fig. 2, the document knowledge context generating apparatus 200 of the present invention may be installed in an electronic device. Depending on the function implemented, the literature knowledge context generating means may comprise: a category tag set acquisition unit 210, a target document range acquisition unit 220, a standard entity reference acquisition unit 230, a category tag and standard entity reference set acquisition unit 240, and a document knowledge context formation unit 250. The unit referred to herein, also referred to as a module, refers to a series of computer program segments, which can be executed by a processor of an electronic device and which can perform a fixed function, stored in a memory of the electronic device.
In the present embodiment, the functions concerning the respective modules/units are as follows:
A category tag set acquiring unit 210, configured to perform tag classification on a document to be detected, and acquire a category tag set corresponding to the document to be detected;
A target document range obtaining unit 220, configured to obtain query information, and obtain a target document range corresponding to the query information in the document to be detected based on the query information and the category tag set;
The standard entity reference acquiring unit 230 is configured to perform entity extraction on the target document within the target document range, so as to acquire all standard entity references in the target document.
In this unit, the process of obtaining all standard entity designations in the document to be detected further comprises the steps of:
s131: acquiring all entity names corresponding to the document to be detected based on a pre-trained entity identification model;
s132: and linking the entity references to the standard atlas based on an entity linking technology, and acquiring standard entity references corresponding to the entity references.
As a specific example, the safe medical knowledge graph is a product that associates a medical field multidimensional database through knowledge graph technology, and provides a user with massive specialized medical knowledge. The method integrates 100 ten thousand core medical terms, 1000 ten thousand medical terms and 1600 ten thousand medical relations, achieves the comprehensive knowledge data aggregation in the medical ecological circle, covers the core medical concepts of diseases, medicines, inspection, examination, operation, genes, departments and the like, and provides personalized solutions based on accurate medical knowledge for each role in a clinical path.
In addition, the medical named entity recognition model may employ a BioBERT-based deep learning model. BioBERT is a language pre-training model of the disclosed medical biotechnology field, which contains information of tens of millions of medical biotechnology field and general field literature. In the invention, the shell encodes basic semantic information of a text (text of a document to be detected, the text is the same as the text) through BioBERT, then learns the characteristics of the task through a bidirectional LSTM layer, and finally optimizes an entity sequence through a CRF layer. Through the model, entity names in the text of the document to be detected can be acquired, and then the entity names can be linked to a standard map concept through an entity linking technology based on text similarity.
For example, the text in the acquisition subject is "heart transplant", and since "heart transplant" is not a standard term, the standard on the standard map is called "heart transplant", and for this purpose "heart transplant" can be linked to "heart transplant" by a physical reference link.
Specifically, in the document knowledge context generating method of the present invention, the query information may be the Title and Abstract of an article, and the document to be detected is predicted by using an MLG-Bert model, for example, the Title and Abstract of the article may be input by using the MLG-Bert model, title represents the Title, abstract represents the Abstract, and bioBERT of the model generates an overall vector representation, wherein bioBERT adopts the corpus of biomedical text for pre-training, and compared with the Bert pre-training model using a general corpus, the negative influence caused by word distribution offset can be reduced. The CNN (convolutional neural network) layer is added after BioBert of the model, so that characteristics can be better extracted and combined, and finally category labels Labels of documents are output through Dot multiplication. The GCN layer is then added as an embedded network layer of labels that enhances the non-linearity capability of the model by embedding value inputs of node features.
In particular, the model may be trained using binary cross entropy as a penalty of bioBert. Finally, the model is used for predicting the articles without the mesh term, and the articles are also classified into a classification system for subsequent document analysis.
Specifically, the step of obtaining a standard entity reference corresponding to the entity reference includes:
s1321: acquiring a synonymous information item corresponding to the entity reference based on the entity reference, and determining a reference item set based on the entity reference and the synonymous information item;
s1322: searching a candidate entity item set (entity recall) corresponding to the index item set in a preset knowledge base based on the index item set;
S1323: extracting dimension reduction features of the reference item set and the candidate entity item set respectively;
And performing dimension reduction processing on the index item set and the candidate entity item set respectively to acquire corresponding dimension reduction features, so that subsequent similarity calculation can be facilitated, and the calculation process is simplified.
S1324: performing similarity calculation on the dimension reduction features of the index item set and the candidate entity item set, and sequencing all entities in the candidate entity item set according to scores obtained by the similarity calculation;
In this step, the calculation formula of the similarity is as follows:
Wherein x and y represent entity vector characterizations of different word vectors, respectively, the entity vector characterizations include surface features and deep features, all entities are ranked according to the score of the similarity, and the higher the score is, the higher the ranking of the corresponding entities is.
S1325: and determining an entity set corresponding to the entity index based on the sorting result, wherein the entity in the entity set is used as the standard entity index.
In this step, the foregoing preset number of entities of the similarity score may be taken to form the entity set, where the entities in the entity set are referred to as the standard entities, the preset number may be set according to the requirement, and in the present invention, the preset number may be set to 5.
The two most prominent steps in the entity linking technique described above are entity recall and entity ordering. The generation of entity recall, namely candidate entity set, is to recall as many entities related to the entity recall in the knowledge base according to the existing index items in the text of the document to be detected, and the process requires higher recall rate. Specifically, the cosine similarity between the word vector of the term and the word vector in the text can be calculated according to the text training word vector, for example, the threshold value can be set to be about 0.56, and the synonym of the term can be calculated after the threshold value is larger than the threshold value.
Entity ranking is mainly to rank candidate entity item sets with evidence of no use of category to obtain the most probable entity, for example, in the process of extracting the dimensionality reduction features of the reference item set and the candidate entity item set: acquiring Word2Vec values of all entities in the reference item set and the candidate entity item set; based on a Word2Vec value, acquiring a TF-IDF value of the entity corresponding to the Word2Vec value; multiplying the TF-IDF value as a weight by a word vector of the entity to obtain a dimension reduction feature of the entity in the index item set and the candidate entity item set.
Wherein Word2Vec represents a Word vector corresponding to an entity, TF-IDF value represents Word frequency-inverse document frequency, and a formula for multiplying the TF-IDF value as a weight by the Word vector of the entity is expressed as:
doc_emb=∑TF-IDF(ordi)·Word2vec(wordi)
Wherein doc_emb represents the dimension reduction feature of the index item set/the candidate entity item set, word i represents the ith entity in the index item set/the candidate entity item set, TF-IDF represents the TF-IDF value of the ith entity, and Word2Vec represents the Word2Vec Word vector of the ith entity.
A category label and standard entity reference set obtaining unit 240, configured to obtain a category label and standard entity reference set corresponding to the target document based on the standard entity reference and the category label set;
a document knowledge context forming unit 250, configured to form a document knowledge context corresponding to the query information based on the category label and the standard entity reference set.
After determining the entity designations of the target documents within the target document range in the above units 240 and 250, the corresponding standard entity designations can be determined based on all the entity designations of the target documents, and then all the category labels and all the standard entity designations corresponding to the reference documents can be obtained from the category label set formed by the documents to be detected according to the category labels of the target documents corresponding to the user query information, and then the category label (set) and the standard entity designation set corresponding to the target documents are formed.
Furthermore, according to the category labels and the standard entity index set, the category labels can be subjected to label classification, for example, the category labels comprise a first-level label, a second-level label, a third-level label and the like, standard entity indexes are further classified under each label level, for example, the category labels are further classified into four entity indexes under the first-level label, the category labels are further classified into a plurality of entity indexes under the second-level label, and the like until all the category labels and the standard entity indexes are classified, and document knowledge venues corresponding to the query information are formed.
It should be noted that, the above-mentioned classification of the labels of the category labels may be performed by a preset rule or a conventional label classification method, and the classification method is not particularly limited.
It should be noted that, the embodiments of the document knowledge context generating apparatus may refer to descriptions in the embodiments of the document knowledge context generating method, which are not described herein in detail.
Fig. 3 is a schematic structural diagram of an electronic device for implementing the method for generating the literature knowledge context according to the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a document knowledge context generating program 12, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, including flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may in other embodiments also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of a document knowledge context generating program, but also for temporarily storing data that has been output or is to be output.
The processor 10 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects respective components of the entire electronic device using various interfaces and lines, and executes programs or modules (e.g., a document knowledge context generation program, etc.) stored in the memory 11 by running or executing the programs or modules, and invokes data stored in the memory 11 to perform various functions of the electronic device 1 and process the data.
The bus may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 11 and at least one processor 10 etc.
Fig. 3 shows only an electronic device with components, it being understood by a person skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or may be arranged in different components.
For example, although not shown, the electronic device 1 may further include a power source (such as a battery) for supplying power to each component, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 1 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.
Further, the electronic device 1 may also comprise a network interface, optionally the network interface may comprise a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used for establishing a communication connection between the electronic device 1 and other electronic devices.
The electronic device 1 may optionally further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.
It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
The literature knowledge context generation program 12 stored in the memory 11 in the electronic device 1 is a combination of a plurality of instructions, which when executed in the processor 10, may implement:
Classifying labels of documents to be detected, and acquiring a class label set corresponding to the documents to be detected;
acquiring a target document range corresponding to the query information in the document to be detected based on the query information pre-acquired by the user and the category label set; at the same time, the method comprises the steps of,
Extracting the entity of the target document in the target document range to obtain all standard entity names in the target document;
based on the standard entity index and the class label set, acquiring a class label and a standard entity index set corresponding to the target document;
and forming literature knowledge venation corresponding to the query information based on the category labels and the standard entity reference set.
Optionally, the step of obtaining all standard entity designations in the target document includes:
acquiring all entity references corresponding to the target document based on a pre-trained entity identification model;
and linking the entity references to the standard atlas based on an entity linking technology, and acquiring standard entity references corresponding to the entity references.
Optionally, the step of obtaining a standard entity reference corresponding to the entity reference includes:
Acquiring a synonymous information item corresponding to the entity reference based on the entity reference, and determining a reference item set based on the entity reference and the synonymous information item;
searching a candidate entity item set corresponding to the index item set in a preset knowledge base based on the index item set;
extracting dimension reduction features of the reference item set and the candidate entity item set respectively;
Performing similarity calculation on the dimension reduction features of the index item set and the candidate entity item set, and sorting all entities in the candidate entity item set according to the score obtained by the similarity calculation;
And determining an entity set corresponding to the entity index based on the sorting result, wherein the entity in the entity set is used as the standard entity index.
Optionally, the extracting the dimension reduction features of the finger term set and the candidate entity term set respectively includes:
Acquiring Word2Vec values of all entities in the reference item set and the candidate entity item set;
Based on the Word2Vec value, acquiring a TF-IDF value of the entity corresponding to the Word2Vec value;
multiplying the TF-IDF value as a weight by a word vector of the entity to obtain the dimension reduction features of the term set and the candidate entity term set.
Optionally, the step of classifying the labels of the documents to be detected and acquiring the class label set corresponding to the documents to be detected includes:
acquiring literature data with classification labels as a training data set;
Training an MLG-Bert model based on the training data until the MLG-Bert model is converged within a preset range to form a document classification model;
and acquiring a category label set corresponding to the document to be detected based on the document classification model.
Optionally, the formula for multiplying the TF-IDF value as a weight by the word vector of the entity is expressed as:
doc_emb=∑TF-IDF(wordi)·Word2ec(Rordi)
Wherein Word i represents an entity, TF-IDF represents a TF-IDF value of the entity, and Word2Vec represents a Word2Vec value of the entity.
Specifically, the specific implementation method of the above instructions by the processor 10 may refer to the description of the relevant steps in the corresponding embodiment of fig. 1, which is not repeated herein.
Further, the modules/units integrated in the electronic device 1 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).
The computer-readable storage medium has stored therein at least one instruction that is executed by a processor in an electronic device to implement the document knowledge context generation method described above.
In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.
The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims (5)

1. A method for generating a context of literature knowledge, the method comprising:
Classifying labels of documents to be detected, and acquiring a class label set corresponding to the documents to be detected;
Acquiring a target document range corresponding to the query information in the document to be detected based on the query information and the category label set;
Extracting the entity of the target document to obtain all standard entity references in the target document;
based on the standard entity index and the class label set, acquiring a class label and a standard entity index set corresponding to the target document; the class labels at least comprise a first-level label, a second-level label and a third-level label; the primary signature comprises at least a basis, a diagnosis and a treatment, and the secondary signature comprises at least a drug treatment, a surgical treatment, an interventional treatment, a general treatment, other treatments and prognosis;
Forming literature knowledge context corresponding to the query information based on the category labels and the standard entity reference set; the step of obtaining all standard entity references in the target document comprises the following steps:
acquiring all entity references corresponding to the target document based on a pre-trained entity identification model;
Linking the entity names to a standard map based on an entity linking technology, and acquiring standard entity names corresponding to the entity names; the step of obtaining the standard entity reference corresponding to the entity reference comprises the following steps:
Acquiring a synonymous information item corresponding to the entity reference based on the entity reference, and determining a reference item set based on the entity reference and the synonymous information item;
searching a candidate entity item set corresponding to the index item set in a preset knowledge base based on the index item set;
Extracting surface features and deep features corresponding to the entity references according to the reference item set and the candidate entity item set;
performing similarity calculation on the surface layer features and the deep layer features to obtain the sequence of all entities in the candidate entity item set;
Determining an entity set corresponding to the entity index based on the ranking, wherein the entities in the entity set serve as the standard entity index; after obtaining the ranking of all entities in the candidate entity item set, further comprising:
Acquiring Word2Vec values of all entities in the candidate entity item set;
Based on the Word2Vec value, acquiring a TF-IDF value of the entity corresponding to the Word2Vec value;
and multiplying the TF-IDF value serving as a weight by the word vector of the entity to obtain Embedding corresponding to the document to be detected.
2. The document knowledge context generating method according to claim 1, wherein the step of classifying the documents to be detected and acquiring a category tag set corresponding to the documents to be detected comprises:
acquiring literature data with classification labels as a training data set;
Training an MLG-Bert model based on the training data until the MLG-Bert model is converged within a preset range to form a document classification model;
and acquiring a category label set corresponding to the document to be detected based on the document classification model.
3. A literature knowledge context generation apparatus for implementing the literature knowledge context generation method of any one of claims 1 or 2, the apparatus comprising:
the category label set acquisition unit is used for classifying labels of the documents to be detected and acquiring category label sets corresponding to the documents to be detected;
The target document range acquisition unit is used for acquiring a target document range corresponding to the query information in the document to be detected based on the query information and the category label set;
The standard entity index acquisition unit is used for carrying out entity extraction on the target document so as to acquire all standard entity indexes in the target document;
The category label and standard entity reference set acquisition unit is used for acquiring a category label and a standard entity reference set corresponding to the target document based on the standard entity reference and the category label set; the class labels at least comprise a first-level label, a second-level label and a third-level label; the primary signature comprises at least a basis, a diagnosis and a treatment, and the secondary signature comprises at least a drug treatment, a surgical treatment, an interventional treatment, a general treatment, other treatments and prognosis;
And the literature knowledge context forming unit is used for forming literature knowledge context corresponding to the query information based on the category label and the standard entity reference set.
4. An electronic device, the electronic device comprising:
At least one processor; and
A memory communicatively coupled to the at least one processor; wherein,
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps in the literature knowledge context generation method of any one of claims 1 or 2.
5. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of the document knowledge context generation method according to any one of claims 1 or 2.
CN202110480081.XA 2021-04-30 2021-04-30 Literature knowledge context generation method, device and storage medium Active CN113076432B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110480081.XA CN113076432B (en) 2021-04-30 2021-04-30 Literature knowledge context generation method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110480081.XA CN113076432B (en) 2021-04-30 2021-04-30 Literature knowledge context generation method, device and storage medium

Publications (2)

Publication Number Publication Date
CN113076432A CN113076432A (en) 2021-07-06
CN113076432B true CN113076432B (en) 2024-05-03

Family

ID=76616126

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110480081.XA Active CN113076432B (en) 2021-04-30 2021-04-30 Literature knowledge context generation method, device and storage medium

Country Status (1)

Country Link
CN (1) CN113076432B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792117B (en) * 2021-08-30 2024-02-20 北京百度网讯科技有限公司 Method and device for determining data update context, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156286A (en) * 2016-06-24 2016-11-23 广东工业大学 Type extraction system and method towards technical literature knowledge entity
CN109241278A (en) * 2018-07-18 2019-01-18 绍兴诺雷智信息科技有限公司 Scientific research knowledge management method and system
CN110457491A (en) * 2019-08-19 2019-11-15 中国农业大学 A kind of knowledge mapping reconstructing method and device based on free state node
CN111382276A (en) * 2018-12-29 2020-07-07 中国科学院信息工程研究所 Event development venation map generation method
CN111428036A (en) * 2020-03-23 2020-07-17 浙江大学 Entity relationship mining method based on biomedical literature

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9886525B1 (en) * 2016-12-16 2018-02-06 Palantir Technologies Inc. Data item aggregate probability analysis system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156286A (en) * 2016-06-24 2016-11-23 广东工业大学 Type extraction system and method towards technical literature knowledge entity
CN109241278A (en) * 2018-07-18 2019-01-18 绍兴诺雷智信息科技有限公司 Scientific research knowledge management method and system
CN111382276A (en) * 2018-12-29 2020-07-07 中国科学院信息工程研究所 Event development venation map generation method
CN110457491A (en) * 2019-08-19 2019-11-15 中国农业大学 A kind of knowledge mapping reconstructing method and device based on free state node
CN111428036A (en) * 2020-03-23 2020-07-17 浙江大学 Entity relationship mining method based on biomedical literature

Also Published As

Publication number Publication date
CN113076432A (en) 2021-07-06

Similar Documents

Publication Publication Date Title
Sun et al. Sentiment analysis for Chinese microblog based on deep neural networks with convolutional extension features
Ling et al. Integrating extra knowledge into word embedding models for biomedical NLP tasks
JP2009093649A (en) Recommendation for term specifying ontology space
JP2009093650A (en) Selection of tag for document by paragraph analysis of document
Gao et al. An interpretable classification framework for information extraction from online healthcare forums
CN112000778A (en) Natural language processing method, device and system based on semantic recognition
CN114662477B (en) Method, device and storage medium for generating deactivated word list based on Chinese medicine dialogue
CN113076432B (en) Literature knowledge context generation method, device and storage medium
Wang et al. Personalizing label prediction for github issues
Yogarajan et al. Seeing the whole patient: using multi-label medical text classification techniques to enhance predictions of medical codes
Bitto et al. Sentiment analysis from Bangladeshi food delivery startup based on user reviews using machine learning and deep learning
Ding et al. Leveraging text and knowledge bases for triple scoring: an ensemble approach-the Bokchoy triple scorer at WSDM Cup 2017
Sharma et al. Fusion approach for document classification using random forest and svm
Zhang et al. Enhancing clinical decision support systems with public knowledge bases
Wang et al. Toxic comment classification based on bidirectional gated recurrent unit and convolutional neural network
US20220165430A1 (en) Leveraging deep contextual representation, medical concept representation and term-occurrence statistics in precision medicine to rank clinical studies relevant to a patient
CN113065355B (en) Professional encyclopedia named entity identification method, system and electronic equipment
CN112259254B (en) Case search method and device based on interactive feedback and readable storage medium
CN114996400A (en) Referee document processing method and device, electronic equipment and storage medium
Devkota et al. Knowledge of the ancestors: Intelligent ontology-aware annotation of biological literature using semantic similarity
CN113505117A (en) Data quality evaluation method, device, equipment and medium based on data indexes
CN107341169B (en) Large-scale software information station label recommendation method based on information retrieval
CN113688854A (en) Data processing method and device and computing equipment
Vahidnia et al. Document Clustering and Labeling for Research Trend Extraction and Evolution Mapping.
Betteridge et al. Assuming facts are expressed more than once

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant