CN114547346B

CN114547346B - Knowledge graph construction method and device, electronic equipment and storage medium

Info

Publication number: CN114547346B
Application number: CN202210424985.5A
Authority: CN
Inventors: 杨涛; 袁首; 范伟; 刘寓非; 周永杰; 王旭; 彭瑀
Original assignee: Zhejiang Taimei Medical Technology Co Ltd
Current assignee: Zhejiang Taimei Medical Technology Co Ltd
Priority date: 2022-04-22
Filing date: 2022-04-22
Publication date: 2022-08-02
Anticipated expiration: 2042-04-22
Also published as: CN114547346A

Abstract

The application discloses a method and a device for constructing a knowledge graph, electronic equipment and a storage medium, wherein the method comprises the following steps: extracting candidate term texts from a historical CRF form, wherein the candidate term texts comprise the form, a form item and a check item; fusing the candidate term texts based on known standard terms to update term concept information of a knowledge graph; extracting candidate term relations based on the updated knowledge graph and the historical CRF forms, wherein the candidate term relations comprise form-form item corresponding relations and form-check item corresponding relations; updating term relationship information of the knowledge-graph based on the confidence of the candidate term relationships. The CRF form information used by the method for constructing the knowledge graph is relatively accurate, and the available knowledge graph can be quickly constructed.

Description

Knowledge graph construction method and device, electronic equipment and storage medium

Technical Field

The application belongs to the technical field of machine learning, and particularly relates to a method and a device for constructing a knowledge graph, electronic equipment and a storage medium.

Background

Case Report Form (CRF), a file designed according to the protocol of the test, records the data of each subject during the test, and can be used to provide the relevant data of clinical tests to research base, sponsor and statistical department. An Electronic Data Capture (EDC) system for clinical trial Data is suitable for a core information system for drug clinical trial, medical random contrast trial and medical cohort study, and is mainly used for recording the information of a subject and forming an electronic follow-up form. In EDC systems, eCRF is often used instead of paper CRF to collect and manage clinical laboratory data.

In recent years, with the development of machine learning technology, the function of automatically building a library by using an AI CRF to realize automatic CRF design and generation appears, wherein the construction of a knowledge graph plays an important role in the automatic design and generation of the CRF.

The information disclosed in this background section is only for enhancement of understanding of the general background of the application and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

Disclosure of Invention

The application aims to provide a knowledge graph construction method which is used for solving the problem of providing support for the automatic design of CRF.

In order to achieve the above object, the present application provides a method for constructing a knowledge graph, the method comprising:

extracting candidate term texts from a historical CRF form, wherein the candidate term texts comprise a form, a form item and a check item;

fusing the candidate term texts based on known standard terms to update term concept information of a knowledge graph;

extracting candidate term relations based on the updated knowledge graph and the historical CRF forms, wherein the candidate term relations comprise form-form item corresponding relations and form-check item corresponding relations;

updating term relationship information of the knowledge-graph based on the confidence of the candidate term relationships.

In one or more embodiments of the present application, the candidate term text is fused based on the known standard term, which specifically includes:

and taking the known standard term as an initial clustering center, and fusing the candidate term texts by using a partition clustering method.

In one or more embodiments of the present application, the partitional clustering method is a K-means algorithm.

and calculating the similarity of the candidate term text and the known standard term, and fusing the candidate term text based on similarity ranking.

In one or more embodiments of the present application, calculating the similarity between the candidate term text and the known standard term specifically includes:

calculating semantic and/or font similarity of the candidate term text to known standard terms based on edit distance and/or cosine similarity algorithms.

merging residual term texts in the candidate term texts by utilizing a hierarchical clustering algorithm to generate new term concept information, wherein the residual term texts are the candidate term texts which cannot be fused with known standard terms.

In one or more embodiments of the present application, the candidate term text is fused based on known standard terms, which specifically includes:

performing text vectorization on the candidate term text and the known standard terms to obtain a candidate term text representation and a standard term text representation;

fusing the candidate term text based on the candidate term text tokens and standard term text tokens.

In one or more embodiments of the present application, the known standard terms include known terms in clinical data acquisition standards and/or research data table formats.

In one or more embodiments of the present application, extracting candidate term relationships based on the updated knowledge graph and the historical CRF forms specifically includes:

based on a preset template, performing context matching on the updated knowledge graph and term texts in a historical CRF form pairwise to extract a candidate term relationship; and/or the presence of a gas in the gas,

and performing relation prediction on the updated knowledge graph and term texts in the historical CRF forms based on a pre-training model to extract candidate term relations.

In one or more embodiments of the present application, the preset template includes an artificial template and/or a statistical template.

In one or more embodiments of the present application, the method further comprises:

extracting term texts and term relations from a study flow chart and a full text of a historical clinical trial scheme respectively;

updating term concept information and term relationship information of the knowledge-graph based on the confidence levels of the term text and the term relationship;

wherein the research flow chart comprises an access task information block.

In one or more embodiments of the present application, the term text and term relationships are extracted from a study flow chart of a historical clinical trial, including in particular:

analyzing the text of the research flow chart, and splitting an interview task information block;

identifying an interview task from the interview task information block, and matching the interview task with a standard interview task in the knowledge graph to obtain a first candidate interview task set as a term text;

analyzing the examination items of the visit tasks in the first candidate visit task set as term texts, and generating corresponding relation information of the visit tasks and the examination items as term relations.

In one or more embodiments of the present application, the term text and term relationships are extracted from the entire historical clinical protocol, including in particular:

scanning the full text of the historical clinical test scheme to obtain an interview task;

matching the interview task obtained by scanning with the standard interview task in the knowledge graph to obtain a second candidate interview task set as a term text;

analyzing the examination items of the visit tasks in the second candidate visit task set as term texts, and generating corresponding relation information of the visit tasks and the examination items as term relations.

extracting protocol metadata from the historical clinical trial protocol;

and extracting the application conditions of the form, the form-form item corresponding relation and the form-check item corresponding relation based on the scheme metadata so as to update the application condition information of the knowledge graph.

The present application further provides a device for constructing a knowledge graph, the device for constructing a knowledge graph includes:

the term text extraction module is used for extracting candidate term texts from the historical CRF forms, wherein the candidate term texts comprise the forms, the form items and the check items;

a term text updating module for fusing the candidate term texts based on known standard terms to update term concept information of a knowledge graph;

a term relation extraction module, configured to extract candidate term relations based on the updated knowledge graph and the historical CRF forms, where the candidate term relations include a form-form item correspondence and a form-check item correspondence;

and the term relation updating module is used for carrying out confidence degree sequencing on the candidate term relation so as to update the term relation information of the knowledge graph.

The present application further provides an electronic device, including:

at least one processor; and

a memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the method of knowledge-graph construction as described above.

The present application also provides a machine-readable storage medium having stored thereon executable instructions that, when executed, cause the machine to perform the method of constructing a knowledge-graph as described above.

Compared with the prior art, according to the construction method of the knowledge graph, the candidate term texts are extracted through the historical CRF form, and are fused under the guidance of the knowledge graph so as to update the term concept information of the knowledge graph; extracting candidate term relations according to the updated knowledge graph and the historical CRF form to update term relation information of the knowledge graph, wherein the information of the CRF form is relatively accurate, so that an available knowledge graph can be quickly constructed;

in another aspect, through information extraction of three dimensions of a research flow chart, a full text and scheme metadata of a historical clinical experimental scheme, the obtained term texts, term relations and applicable conditions of the term texts and the term relations are used for updating term concept information, term relation information and applicable condition information of the knowledge graph, the process is similar to the automatic generation flow of the CRF, the style of the processed texts is consistent with the application environment, higher information integrity can be provided compared with the CRF form, and the coverage range of the knowledge graph is improved.

Drawings

FIG. 1 is a schematic diagram of an application scenario of a method and apparatus for constructing a knowledge graph according to an embodiment of the present application;

FIG. 2 is a logical framework diagram of a method of construction of a knowledge-graph according to an embodiment of the present application;

FIG. 3 is a flow diagram of a method of construction of a knowledge-graph according to an embodiment of the present application;

FIG. 4 is a block diagram of an apparatus for constructing a knowledge graph according to an embodiment of the present application;

fig. 5 is a hardware configuration diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in detail below with reference to embodiments shown in the drawings. The embodiments are not limited to the embodiments, and structural, methodological, or functional changes made by those skilled in the art according to the embodiments are included in the scope of the present disclosure.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The clinical trial protocol is a document describing how clinical trials will be conducted, including the goals, design, methodology, statistical considerations and the organization of the trial, while also providing context and reasons for conducting the study, the study problem to be solved, and considerations of ethical issues to ensure the safety of the participants and the integrity of the collected data.

In the stage from a clinical trial scheme to the design of the CRF, the clinical Data Manager (DM) is usually dependent on the working experience of the CRF in the related fields (such as clinical trial stage, treatment field, etc.), for example, the CRF is difficult to complete completely due to lack of the related experience, and multiple rounds of modification and adjustment are often required in the auditing stage; for some specific project tasks not defined in CDASH (Clinical Data acquisition Standards) or SDTM (Study Data planning model), there is no uniform execution standard (for example, tumor-like program needs to design a series of tables for tumor history acquisition, tumor assessment, efficacy assessment, and treatment tracking, etc.), and the difference between different projects and DM is large according to the generation result of the final CRF. In addition, due to problems of long text space, irregular writing and the like of clinical trial protocol texts, key information under some visiting tasks such as time points, specific examination items, different requirements of different visits and the like can be scattered in various places of the document (for example, the specific examination items of blood biochemical examination can be mentioned in the inclusion standard besides texts such as visiting flow charts, flow chart remarks, research flows and the like), and great challenges are brought to the integrity of information collection.

The method is mainly applied to a scene that an EDC system automatically generates the CRF by utilizing an Artificial Intelligence (AI) technology. Specifically, referring to fig. 1, the server may receive a clinical trial protocol and extract and integrate information in the clinical trial protocol under the guidance of a knowledge map through a machine learning model configured therein to automatically generate an accurate and personalized CRF. The generated CRF can be transmitted to different terminal equipment for displaying, and can be reviewed by related personnel (such as statistics, medicine, operation, researchers) to put forward possible modification suggestions to help CRF finalization. It should be understood that the server and the terminal device included in the scenario may be independent devices, or may be integrated in the same system (for example, an EDC system), which is not limited herein; the knowledge graph may be configured in a server for invocation, or alternatively, the knowledge graph may be stored in local storage or cloud storage with which the EDC system communicates and is invoked for use.

With reference to fig. 2 and 3, an embodiment of the method for constructing a knowledge graph of the present application will be described. In this embodiment, the method includes:

and S11, extracting candidate term texts from the historical CRF forms.

The candidate term text may include a Form (Form), a Form Item (Item), and an inspection Item (Index). Attributes of the form include standard name, alias, CDASH domain, etc. Attributes of the form items include standard names, aliases, CDASH variables, SDTM objects, and the like. The attributes of the check item include a standard name, an alias, a check method, a unit, a standard range, and the like.

In the case of CDASH, the fields are classified according to data types, such as past Medical History (MH), laboratory test results (LB), Adverse Events (AE), and the like. A unified standard is provided for each field in the domain from the question description, prompt, SDTM object or CDASH variable name, BRIDG, definition, CRF filling out guide, sponsor supplemental information, core category 8.

Table I exemplarily shows the CDASH variables or SDTM objects to which three problem descriptions correspond for the adverse event domain.

Table I.

For clinical trial protocols, an "adverse event" is a common type of visit task, and usually corresponding examinations are performed on the subjects at multiple time points of the periodic visit, and the examination items for the adverse event may include clinical symptoms (such as nausea, fatigue, dizziness, abdominal pain, pruritus, etc.), signs (such as jaundice, rash, fever, etc.), illness, abnormal laboratory examinations, and the like. Therefore, it can be understood that, in the candidate term text extracted at this time, "form", "form item" and "examination item" correspond to "adverse event", "AEYN/AESTDAT/AESEV" and "clinical symptom/sign/disease and laboratory examination abnormality", respectively.

And S12, fusing the candidate term texts based on the known standard terms to update term concept information of the knowledge graph.

The known standard terms may include, in addition to being from the current knowledge map, known terms in the clinical data acquisition standard CDASH and/or the research data table format SDTM.

In the process of fusing the candidate term text, text vectorization can be firstly carried out on the candidate term text and the known standard term to obtain a candidate term text representation and a standard term text representation; and fusing the candidate term text based on the candidate term text representation and the standard term text representation.

In one embodiment, the text characterization learning in units of words or in units of words may be constructed based on the historical clinical trial scenario text, and then the candidate term text may be matched with the words or words, so as to obtain the text characterization corresponding to the candidate term text.

In another embodiment, an open-source pre-training model based on large-scale corpus in the public domain may also be used to obtain text representations in units of words or in units of words, which are not described herein again.

In a specific text vectorization process, vectors for each term text may be calculated based on a bag of words model (BOW).

Illustratively as in table II, the term bag is first constructed, which includes "bad, good, bad, reaction, non, very, tight, heavy" inside. For the term text "adverse reaction", the numerical value of each word in the bag of words (One-Hot notation, presence of 1, absence of 0) can be calculated, resulting in the vector [1, 1, 1, 1, 0, 0, 0, 0] of the term text.

Table II.

Of course, in alternative embodiments, text vectorization may also be performed in a manner based on a deep neural network, which is not described herein again.

In embodiments of the present application, candidate term text may be fused from three dimensions.

Firstly, using a known standard term as an initial clustering center, and fusing candidate term texts by using a partition clustering method.

For each known standard term as an initial cluster center, the distance of the candidate term text from the initial cluster center can be calculated separately. Taking n candidate term texts as an example, n distance values exist for each initial cluster center, the distance values are sorted, the closest data point is found, the sum of the distances is calculated, and iteration is performed.

The term text that is the center of the cluster may change during each iteration. For example, for the k candidate term texts for the initial cluster, what was originally the cluster center is a known standard term, and at the next iteration, one of the k candidate term texts may be elected as the new cluster center. After a plurality of iterations, the loss function converges to obtain the final clustering result. At this time, the m candidate term texts included in the final clustering result can be considered to be interfused with each other.

In one embodiment, the partitional clustering method used herein is a K-means algorithm.

And secondly, calculating the similarity of the candidate term text and the known standard term, and fusing the candidate term text based on similarity sorting.

The similarity calculation here may be based on different feature expansions, so that for the candidate term text and the known standard term, an appropriate direction may be selected first for feature construction.

For example, two features of "semantic" and "glyph" can be selected for candidate term text and known standard terms for feature extraction, and feature construction can be performed based on the semantic and/or glyph.

In a specific similarity calculation, the semantic and/or font similarity of the candidate term text to the known standard term may be calculated based on an edit distance and/or cosine similarity algorithm.

Exemplarily, if the term text is a word, the edit distance refers to the minimum number of edits required to convert the word a into the word B, and the allowable edit operations may include "replace", "insert", "delete", and the like. For example, the word a is an "adverse reaction event" and the word B is an "adverse reaction event", 1 operation is required to convert the word a into the word B, that is: delete "in word B, the edit distance between word a and word B is 1.

Generally, the smaller the edit distance, the greater the degree of similarity of the two terms text; conversely, the greater the edit distance, the less similar the text of the two terms. For a certain known standard term, candidate term texts which can be finally fused can be determined based on a preset distance threshold or similarity ranking.

The cosine similarity algorithm is used for calculating cosine values of included angles of text representations (word vectors) corresponding to the two terms, wherein the more the cosine values are close to 1, the closer the surface included angle is to 0 degree, and the more similar the terms corresponding to the two text representations are. Similarly, the candidate term text which can be finally fused can be determined based on a preset cosine threshold or similarity ranking.

Merging residual term texts in the candidate term texts by utilizing a hierarchical clustering algorithm to generate new term concept information.

The residual term text is a candidate term text that cannot be fused with known standard terms, and further verification can be performed manually for the new term concept information here.

The Hierarchical Clustering algorithm creates a Hierarchical nested cluster tree by calculating the similarity between the residual term texts. In a cluster tree, each residual term text is the lowest level of the tree, the top level of which is the root node of a cluster.

In an embodiment, the clustering tree may be created using a bottom-up merge approach. In a specific clustering process, each residual term text may be regarded as a class, then the distance (e.g., euclidean distance) between the classes is calculated, and the two classes with the closest distance are selected and combined into one class. And the new class continues to calculate the distance, combines the two classes with the closest distance, and repeats the steps.

For the process of cluster merging, a termination condition can be set to obtain relatively ideal new term concept information. For example, the termination condition may be "merge times reach N times", "new term concepts obtained by clustering reach M times", or the like.

And S13, extracting candidate term relations based on the updated knowledge graph and the historical CRF form.

Candidate term relationships include form-to-form item correspondence and form-to-check item correspondence.

For example, in the terminology concept of the knowledge map and historical CRF forms, "form" and "examination item" correspond to "adverse event", "AEYN/AESTDAT/AESEV" and "clinical symptom/sign/disease and laboratory examination abnormality", respectively, then "form-form" corresponds to "adverse event-AEYN/AESTDAT/AESEV" and "form-examination item" corresponds to "adverse event-clinical symptom/sign/disease and laboratory examination abnormality".

In an embodiment, the candidate term relationship may be extracted based on a preset template, and the updated knowledge graph and the term text in the historical CRF form are subjected to context matching pairwise, and if the corresponding relationship defined by the preset template is satisfied, the candidate term relationship is extracted.

The preset template can be an artificial template and/or a statistical template, wherein the artificial template can be used for judging whether the term concept (entity) has a context relationship, and the statistical template can be extracted based on a search engine.

In one embodiment, the candidate term relationship may be extracted by performing relationship prediction on the updated knowledge graph and the term texts in the historical CRF forms based on a pre-trained model. The relationship prediction here may be, for example, after or simultaneously with Named Entity Recognition (NER).

And S14, updating term relation information of the knowledge-graph based on the confidence degree of the candidate term relation.

The confidence level is a part of the output of the candidate term relationship extraction algorithm in step S13. For example, a deep neural network (pre-trained model) is used for relationship prediction, the activation function of the output layer is typically Softmax, and the current output can be regarded as the probability of belonging to each classification (relationship), so that multi-classification is performed, and the probability can be used as confidence.

In one embodiment, a confidence threshold may be set for a candidate term relationship, and when the confidence of the candidate term relationship is greater than the confidence threshold, the candidate term relationship is considered as "true", and term relationship information of the knowledge graph is updated.

In steps S11 to S14, the historical CRF forms are mainly used to extract and process the related information, so as to update the term concept information and the term relationship information of the knowledge map. The information of the CRF form is relatively accurate and convenient to process, and the basic data of the knowledge graph can be quickly constructed.

In the following steps, the introduction continues to supplement the term concept information and the term relation information in the knowledge-graph with the historical clinical trial plan on the basis of the updated knowledge-graph.

And S15, extracting term texts and term relations from the study flow chart and the whole text of the historical clinical trial scheme respectively.

In particular, the following describes the extraction of the term text and the term relationship from the research study flow chart and the full text, respectively.

Study flow chart

Table III demonstrates some of the information in a common study flow chart.

Table III.

As can be seen, the access task information blocks in table III include demographic/medical history, adverse events, and quality of life. The term texts corresponding to the visit task information blocks can be updated to the term text concepts in the knowledge graph.

Specifically, a historical clinical trial protocol may first be structurally analyzed to locate the study flow chart therein; and then, performing text analysis on the research flow chart, and splitting an information block of the visit task.

Taking the study flow chart shown in table III above as an example, the text parsing process is parsing and analyzing a table structure, where the header line(s) correspond to the visit task information block.

Secondly, the interview tasks can be identified from the interview task information block and matched with the standard interview tasks in the knowledge graph to obtain a first candidate interview task set as a term text (Form).

It is understood that the standard interview task in the knowledge-graph is essentially part of the known standard terminology at this time.

In one embodiment, when the interview task is identified, the interview task can be extracted through NLP technologies such as word segmentation, named entity identification, multi-pattern matching algorithm, text vectorization and similarity calculation and the like based on a knowledge graph.

For each access task in the first access task candidate set, the above mentioned NLP technology for identifying the access task can be used as well to parse the check item for each access task in the access task information block, where the check item can also be used as the term text (check item Index). Meanwhile, the access task-check item correspondence information is generated as a term relationship (Form-check item correspondence Form-Index).

Exemplarily, for the visit task of the blood routine, what needs to be detected includes red blood cell count (RBC), hemoglobin (Hb), White Blood Cells (WBC), white blood cell differential count (pbd), and Platelets (PLT), the generated term relationship can be expressed as "blood routine-RBC/Hb/WBC/PLT".

② the whole history of clinical test scheme

For the whole history clinical test scheme, the method can be used for scanning the whole history clinical test scheme to obtain an interview task; similarly, matching the visit task obtained by scanning with a standard visit task in a knowledge graph spectrum to obtain a second candidate visit task set as a term text; and analyzing the examination items of the visit tasks in the second candidate visit task set as term texts, and using the generated visit task-examination item corresponding relation information as term relations.

The retrieval of the visit task and the analysis of the examination items in the full-text historical clinical experimental scheme can be realized by the NLP technology such as word segmentation, named entity recognition, multi-mode matching algorithm, text vectorization and similarity calculation and the like based on the knowledge graph, and the details are not repeated here.

And S16, updating term concept information and term relation information of the knowledge-graph based on the confidence of the term text and the term relation.

The knowledge graph term concept information and the term relation information are updated based on a historical clinical experiment scheme, the process is similar to the process of generating the CRF by machine learning, the information integrity is higher than that obtained directly from a CRF form, the processed text style is consistent with the application environment, and the coverage range of the knowledge graph can be improved.

For a knowledge graph, applicable condition information may also be present therein. The applicable condition information may include three parts: forms, form-to-form item correspondence, and form-to-check item correspondence. In one embodiment, protocol metadata may be extracted from the historical clinical trial protocol, and applicable conditions for the form, the form-to-form item correspondence, and the form-to-exam item correspondence may be extracted based on the protocol metadata.

The protocol metadata may include the trial field, trial phase, and indication, etc. Taking a form aiming at a medical history as an example, according to different specific research diseases (scheme metadata), the contents of corresponding form items and the contents of examination items are different; at this time, the schema metadata may correspond to the applicable conditions for extracting the form-form item correspondence and the form-check item correspondence. Alternatively, for the study of a tumor-related test field (protocol metadata), a tumor evaluation series form and a subsequent anti-tumor treatment related form are required to be configured; at this time, the schema metadata may correspond to applicable conditions for extracting the form.

In an embodiment, the scheme metadata in the historical clinical trial scheme may be extracted in a structured manner, and the extraction of the applicable conditions in the scheme metadata may also be implemented by the NLP technologies such as word segmentation, named entity recognition, multi-pattern matching algorithm, text vectorization, similarity calculation, and the like, based on the above-mentioned knowledge graph, and will not be described herein again.

Referring to FIG. 4, an embodiment of the apparatus for constructing a knowledge graph of the present application will be described. In this embodiment, the construction device of the knowledge graph includes a term text extraction module, a term text update module, a term relationship extraction module, and a term relationship update module.

The term text extraction module is used for extracting candidate term texts from the historical CRF forms, wherein the candidate term texts comprise the forms, the form items and the check items; the term text updating module is used for fusing the candidate term texts based on known standard terms so as to update term concept information of the knowledge graph spectrum; the term relation extraction module is used for extracting candidate term relations based on the updated knowledge graph and the historical CRF forms, wherein the candidate term relations comprise form-form item corresponding relations and form-check item corresponding relations; the term relation updating module is used for carrying out confidence degree ordering on the candidate term relation so as to update the term relation information of the knowledge-graph.

In an embodiment, the term text updating module is specifically configured to use the known standard term as an initial clustering center, and fuse the candidate term text by using a partitional clustering method.

In one embodiment, the partitional clustering method is a K-means algorithm.

In an embodiment, the term text updating module is specifically configured to calculate a similarity between the candidate term text and a known standard term, and fuse the candidate term text based on a similarity ranking.

In an embodiment, the term text updating module is specifically configured to calculate semantic and/or font similarity between the candidate term text and a known standard term based on an edit distance and/or cosine similarity algorithm.

In an embodiment, the term text updating module is specifically configured to merge residual term texts in the candidate term texts by using a hierarchical clustering algorithm to generate new term concept information, wherein the residual term texts are the candidate term texts which cannot be fused with known standard terms.

In an embodiment, the term text updating module is specifically configured to perform text vectorization on the candidate term text and the known standard term to obtain a candidate term text representation and a standard term text representation;

In one embodiment, the known standard terms include known terms in clinical data acquisition standards and/or research data table formats.

In an embodiment, the term relationship extraction module is specifically configured to perform context matching on the updated knowledge graph and term texts in the historical CRF form two by two based on a preset template, so as to extract candidate term relationships; and/or performing relation prediction on the updated knowledge graph and term texts in the historical CRF forms based on a pre-training model to extract candidate term relations.

In one embodiment, the preset template includes an artificial template and/or a statistical template.

In one embodiment, the term text extraction module and the term relationship extraction module are further configured to extract term text and term relationships from a study flow chart and a full text of a historical clinical trial plan, respectively, and the term text update module and the term relationship update module are further configured to update term concept information and term relationship information of the knowledge-graph based on confidence of the term text and the term relationships; wherein the research flow chart comprises an access task information block.

In an embodiment, the term text extraction module and the term relationship extraction module are further specifically configured to perform text parsing on the research flow chart and split the visit task information block; identifying an interview task from the interview task information block, and matching the interview task with a standard interview task in the knowledge graph to obtain a first candidate interview task set as a term text; analyzing the examination items of the visit tasks in the first candidate visit task set as term texts, and generating corresponding relation information of the visit tasks and the examination items as term relations.

In one embodiment, the term text extraction module and the term relationship extraction module are further specifically configured to scan a full text of the historical clinical trial scenario to obtain an interview task; matching the interview task obtained by scanning with the standard interview task in the knowledge graph to obtain a second candidate interview task set as a term text; analyzing the examination items of the visit tasks in the second candidate visit task set as term texts, and generating corresponding relation information of the visit tasks and the examination items as term relations.

In one embodiment, the term text extraction module and the term relationship extraction module are further specifically configured to extract protocol metadata from the historical clinical trial protocol; and extracting the application conditions of the form, the form-form item corresponding relation and the form-check item corresponding relation based on the scheme metadata so as to update the application condition information of the knowledge graph.

The method of constructing a knowledge graph according to the embodiment of the present specification is described above with reference to fig. 1 to 3. The details mentioned in the above description of the method embodiments are equally applicable to the apparatus for constructing a knowledge graph of the embodiments of the present specification. The above knowledge graph constructing apparatus may be implemented by hardware, or may be implemented by software, or a combination of hardware and software.

Fig. 5 illustrates a hardware configuration diagram of an electronic device according to an embodiment of the present specification. As shown in fig. 5, the electronic device 30 may include at least one processor 31, a storage 32 (e.g., a non-volatile storage), a memory 33, and a communication interface 34, and the at least one processor 31, the storage 32, the memory 33, and the communication interface 34 are connected together via a bus 35. The at least one processor 31 executes at least one computer readable instruction stored or encoded in the memory 32.

It should be appreciated that the computer-executable instructions stored in the memory 32, when executed, cause the at least one processor 31 to perform the various operations and functions described above in connection with fig. 1-3 in the various embodiments of the present description.

In embodiments of the present description, the electronic device 30 may include, but is not limited to: personal computers, server computers, workstations, desktop computers, laptop computers, notebook computers, mobile electronic devices, smart phones, tablet computers, cellular phones, Personal Digital Assistants (PDAs), handheld devices, messaging devices, wearable electronic devices, consumer electronic devices, and the like.

According to one embodiment, a program product, such as a machine-readable medium, is provided. A machine-readable medium may have instructions (i.e., elements described above as being implemented in software) that, when executed by a machine, cause the machine to perform various operations and functions described above in connection with fig. 1-3 in the various embodiments of the present specification. Specifically, a system or apparatus may be provided which is provided with a readable storage medium on which software program code implementing the functions of any of the above embodiments is stored, and causes a computer or processor of the system or apparatus to read out and execute instructions stored in the readable storage medium.

In this case, the program code itself read from the readable medium can realize the functions of any of the above-described embodiments, and thus the machine-readable code and the readable storage medium storing the machine-readable code form part of this specification.

Examples of the readable storage medium include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RAMs, DVD-RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer or from the cloud via a communications network.

It will be understood by those skilled in the art that various changes and modifications may be made in the above-disclosed embodiments without departing from the spirit of the invention. Accordingly, the scope of the present description should be limited only by the attached claims.

It should be noted that not all steps and units in the above flows and system structure diagrams are necessary, and some steps or units may be omitted according to actual needs. The execution order of the steps is not fixed, and can be determined as required. The apparatus structures described in the above embodiments may be physical structures or logical structures, that is, some units may be implemented by the same physical client, or some units may be implemented by multiple physical clients, or some units may be implemented by some components in multiple independent devices.

In the above embodiments, the hardware units or modules may be implemented mechanically or electrically. For example, a hardware unit, module or processor may comprise permanently dedicated circuitry or logic (such as a dedicated processor, FPGA or ASIC) to perform the corresponding operations. The hardware units or processors may also include programmable logic or circuitry (e.g., a general purpose processor or other programmable processor) that may be temporarily configured by software to perform the corresponding operations. The specific implementation (mechanical, or dedicated permanent, or temporarily set) may be determined based on cost and time considerations.

The detailed description set forth above in connection with the appended drawings describes example embodiments but is not intended to represent all embodiments which may be practiced or which fall within the scope of the appended claims. The term "exemplary" used throughout this specification means "serving as an example, instance, or illustration," and does not mean "preferred" or "advantageous" over other embodiments. The detailed description includes specific details for the purpose of providing an understanding of the described technology. However, the techniques may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described embodiments.

The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for constructing a knowledge graph, the method comprising:

updating term relationship information of the knowledge-graph based on the confidence of the candidate term relationships;

the candidate term text is fused based on the known standard term, and the fusion specifically includes:

taking the known standard term as an initial clustering center, and fusing the candidate term texts by using a partition clustering method;

calculating the similarity of the candidate term text and a known standard term, and fusing the candidate term text based on similarity ranking; and

merging residual term texts by utilizing a hierarchical clustering algorithm to generate new term concept information, wherein the residual term texts are the candidate term texts which cannot be fused with known standard terms.

2. The method for constructing a knowledge-graph according to claim 1, wherein calculating the similarity between the candidate term text and the known standard term specifically comprises:

3. The method for constructing a knowledge-graph according to any one of claims 1 to 2, wherein the candidate term text is fused based on a known standard term, and specifically comprises:

4. The method of constructing a knowledge-graph according to any one of claims 1 to 2 wherein the known standard terms comprise known terms in clinical data acquisition standards and/or research data tabular formats.

5. The method for constructing a knowledge graph according to claim 1, wherein extracting candidate term relationships based on the updated knowledge graph and the historical CRF forms specifically comprises:

based on a preset template, carrying out context matching on the updated knowledge graph and term texts in a historical CRF form pairwise so as to extract a candidate term relationship; and/or the presence of a gas in the gas,

6. The method of constructing a knowledge-graph of claim 1, wherein the method further comprises:

wherein the research flow chart comprises an access task information block.

7. The method for constructing a knowledge graph according to claim 6, wherein extracting term texts and term relations from a study flow chart of a historical clinical laboratory, specifically comprises:

analyzing the examination items of the visit tasks in the first candidate visit task set as term texts, and generating visit task-examination item corresponding relation information as term relations.

8. The method for constructing a knowledge graph according to claim 6, wherein extracting term texts and term relations from the full text of the historical clinical experimental plan specifically comprises:

9. The method of constructing a knowledge-graph of claim 1, wherein the method further comprises:

extracting protocol metadata from the historical clinical trial protocols;

10. An apparatus for constructing a knowledge graph, the apparatus comprising:

a term relationship updating module for performing confidence ranking on the candidate term relationships to update term relationship information of the knowledge-graph;

the term text updating module is specifically used for taking the known standard term as an initial clustering center and fusing the candidate term text by using a partition clustering method; calculating the similarity of the candidate term text and a known standard term, and fusing the candidate term text based on similarity ranking; and merging residual term texts in the candidate term texts by utilizing a hierarchical clustering algorithm to generate new term concept information, wherein the residual term texts are the candidate term texts which cannot be fused with known standard terms.

11. An electronic device, comprising:

at least one processor; and

a memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the method of constructing a knowledge-graph of any one of claims 1 to 9.

12. A machine readable storage medium storing executable instructions that when executed cause the machine to perform a method of constructing a knowledge-graph as claimed in any one of claims 1 to 9.