WO2023121165A1 - Procédé de génération de modèle qui prédit une corrélation entre des entités comprenant une maladie, un gène, un matériel et un symptôme à partir de données de document et qui délivre un texte d'argument d'unité et système utilisant ledit procédé - Google Patents

Procédé de génération de modèle qui prédit une corrélation entre des entités comprenant une maladie, un gène, un matériel et un symptôme à partir de données de document et qui délivre un texte d'argument d'unité et système utilisant ledit procédé Download PDF

Info

Publication number
WO2023121165A1
WO2023121165A1 PCT/KR2022/020686 KR2022020686W WO2023121165A1 WO 2023121165 A1 WO2023121165 A1 WO 2023121165A1 KR 2022020686 W KR2022020686 W KR 2022020686W WO 2023121165 A1 WO2023121165 A1 WO 2023121165A1
Authority
WO
WIPO (PCT)
Prior art keywords
words
unit
text
data
model
Prior art date
Application number
PCT/KR2022/020686
Other languages
English (en)
Korean (ko)
Inventor
이동건
김태용
정찬웅
김동인
Original Assignee
주식회사 스탠다임
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 주식회사 스탠다임 filed Critical 주식회사 스탠다임
Publication of WO2023121165A1 publication Critical patent/WO2023121165A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/50ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references

Definitions

  • the present invention relates to a method for generating a model that predicts an association between entities including a queried disease, gene, substance, and symptom from a query document and outputs unit argument text of association prediction together, and a system using the generated model.
  • Searching for a target gene or protein for disease treatment is the first step in new drug development and an important task that has a decisive effect on the success rate of new drug development.
  • association between a disease and a gene or protein means a case in which a disease is caused by a gene or protein, such as the occurrence of a disease.
  • any disease and any gene or protein are related, it can be used to treat any disease through the development of a new drug containing a substance that regulates the expression of the gene or protein. It is also very important to identify the association between genes or proteins.
  • the present inventors developed a prediction model that has high learning efficiency with only a small amount of learning data, predicts associations between diseases, genes or proteins, drugs, and symptoms, and can identify unit texts that are the basis for association prediction. reached
  • Patent Document Korea Patent Publication No. 10-2019-0062413 (2019.06.05)
  • the present invention identifies a relationship between each word and contextual information between unit texts and derives a result value for it by checking a first prediction model, a first type word that refers to a query document and a disease, and a gene.
  • a first prediction model a first type word that refers to a query document and a disease
  • a gene a gene that refers to a symptom
  • the result value output from the first prediction model is used to input the input.
  • the second prediction model that outputs whether there is an association between the two words in the query document and whether each unit text included in the query document is a unit argument text, it is possible to confirm the association between the two words and use it to predict the association.
  • the purpose is to provide a method and system that can also check the unit argument text as well.
  • the efficiency of learning is improved by excluding data labeled as not related between two words from learning, and prediction of whether there is relevance and whether the unit argument text exists. It is an object to provide a method and system with improved accuracy.
  • a method in which the learning efficiency of the first data and the second data is improved by using the pre-trained first prediction model to derive the relationship between each word and context information between unit texts, and Its purpose is to provide a system.
  • the learning weight is adjusted using the prediction accuracy of all tokens of the first document data to be learned, a method in which high learning efficiency is achieved even with a small number of biological document data And the purpose is to provide a system.
  • a first prediction model generating device learns first document data including at least one unit text including at least one word, Generating a first predictive model that derives the relationship between each word and context information between unit texts, and (b) a second predictive model generating device in the first predictive model generated by the first predictive model generating device, Two words of different types among the first type word referring to a disease, the second type word referring to a gene, the third type word referring to a substance, and the fourth type word referring to a symptom and the association between the two words Of the first data set consisting of whether or not the two words are related, the second data consisting of the pair of unit arguments (clue text) determined to be related between the two words, whether or not the two words are related, and whether or not the two words are related.
  • a set of data excluding data having no association between two words of different types among the first data set and the second data set is simultaneously learned.
  • the second data is learning target data consisting of pairs of two words of different types, correlation between the two words, and whether unit text included in a query document corresponds to unit argument text. And non-learning target data consisting of pairs of two words of different types and no association between the two words and whether or not the unit text included in the query document corresponds to the unit argument text, wherein in the step (b) , Only learning target data among the second data together with the first data can be learned.
  • step (b) the first data set and the learning target data may be simultaneously learned.
  • the first predictive model generation apparatus learns biological document data or medical document data to derive relationships between words and context information between unit texts. 1 may further include generating a predictive model.
  • the step (a) may include a token generation step in which a first tokenization module tokenizes unit texts of the first document data based on words to generate multiple tokens, and first masking.
  • a first masking step in which a module masks some of the generated tokens
  • a conversion step in which a conversion module converts the masked tokens into tokens corresponding to a word different from a word before masking
  • a first learning module A prediction step of predicting all tokens, including this converted token, as either original or replaced, and a first prediction accuracy calculation module predicts all tokens of the first document data by the first learning module.
  • a first prediction model generation step of calculating accuracy and generating the first prediction model by a first prediction model generation module using the calculated prediction accuracy.
  • the step (b) is a second masking step in which the second masking module masks data that has no correlation between two words of different types among the second data, and the preprocessing module inputs different data.
  • a preprocessing data generation step of generating preprocessed data by preprocessing second document data including two arbitrary words of a type and unit texts including the two arbitrary words in a predetermined method
  • a derivation step in which preprocessed data is input to the first prediction model and hidden states of each of the preprocessed data are derived through an output layer of the first prediction model, the hidden states of the second prediction model before learning weight adjustment
  • An output step of outputting whether or not there is an association between the two arbitrary words and whether or not each unit text included in the second document data is a unit argument text by passing through an association layer and a clue layer; and 2
  • the learning module 22 simultaneously learns whether or not there is an association output in the output step and whether or not there is an argument sentence, and the association loss between the arbitrary first word and the arbitrary second
  • the second document data to be preprocessed in the preprocessing data generating step has one unit text including all of the two arbitrary words or includes at least one of the two arbitrary words. can have different unit texts.
  • a first token containing contextual information capable of confirming association between words is added to unit text in the form of a question
  • a second token containing contextual information determining whether individual unit text corresponds to unit argument text is added to a query document. It may include a token addition step added to each of the unit texts of the abstract of.
  • step (c) query document data including one or more unit texts and two different types of words among the first to fourth type words are input to the input device. and (d) the query document data input in the step (c) and two different types of words are input to the input layer of the second prediction model, and the different types of words are input through the output layer of the second prediction model.
  • the method may further include outputting whether or not there is a connection between two types of words and whether or not each of the unit texts included in the query document data is a unit argument text.
  • the step (c) may include (c1) first query document data including one or more unit texts in an input device and two different types of words of the first to fourth types Step of inputting a word and (c2) second query document data including one or more unit texts into an input device and any one of the two different types of words among the first to fourth type words
  • the method may further include calculating whether or not there is an association between the three words and unit argument text of the association prediction, and outputting the calculated information.
  • the present invention is a system using a model generated according to the above method, comprising a query document, a first type word referring to a disease, a second type word referring to a gene, a third type word referring to a substance, and An input device configured to input association prediction target words including two different types of two words of different types among two words of a different type among fourth type words indicating a symptom, and the generated model are transmitted, and the transmitted model
  • An entity including diseases, genes, substances, and symptoms including an arithmetic device configured to predict an association between association prediction target words input through the input device and output unit argument text of association prediction through an output device using We provide a system that predicts the correlation between the two and outputs the unit argument text.
  • the present invention provides a computer program stored in a computer readable recording medium to execute the above-described method.
  • a first prediction model that identifies relationships between words and contextual information between unit texts and derives a result value thereof, a query document, a first type word referring to a disease, and a second type referring to a gene
  • the two words input using the result value derived from the first prediction model
  • the second prediction model that outputs whether there is a correlation between the two words and whether or not each unit text included in the query document is an argument text, not only can the connection between the two words be confirmed, but the argument text used to predict the association is also together. It has the advantage of being verifiable.
  • the efficiency of learning is improved by excluding data labeled as not related between two words from learning, and the prediction accuracy of whether there is relevance and whether it is reason text is also It improves.
  • the pretrained first prediction model is used to derive the relationship between each word and the context information between unit texts, the learning efficiency of the first data and the second data is improved.
  • FIG. 1 is a schematic block diagram for explaining a system according to an embodiment of the present invention.
  • FIG. 2 is a flowchart illustrating a method according to an embodiment of the present invention.
  • FIG. 3 is a flowchart for explaining in detail the process of generating the first predictive model (S10) of FIG. 2. Referring to FIG. 3
  • FIG. 4 is a flowchart for explaining in detail the process of generating the second predictive model (S20) of FIG. 2. Referring to FIG. 4
  • FIG. 5 is a diagram for explaining a process of generating a second predictive model in more detail.
  • FIG. 6 is a diagram for explaining forms of a query document, pre-processed data when a first word and a second word are input, and data output from a second prediction model.
  • FIG. 7 is a schematic diagram for explaining data to be learned and non-learned data, taking medical document data as an example.
  • a word referring to a disease is a first type word
  • a word referring to a gene is a second type word
  • a word referring to a substance (drug) is a third type word
  • a word referring to a symptom is a fourth type word. Defined as a type word.
  • the term “gene” refers to an individual unit of genetic information composed of a specific base sequence in a genome composed of DNA or RNA, and genetic information composed of a specific amino acid sequence in a genome composed of protein as well as DNA and RNA. It is a concept that also includes individual units of That is, proteins containing genetic information should also be understood as a concept included in the concept of "gene” herein.
  • unit text is a piece of two or more words that form a clause or sentence, and is divided into noun phrases, verb phrases, adjective phrases, adjective phrases, and adverb phrases depending on the type, 'phrase', subject and predicate.
  • 'Clause' a unit that is used as a component of other sentences but cannot be used independently
  • 'sentence' which includes one or more phrases and clauses and has a subject and predicate and can be used independently, in a document It is a concept that includes 'paragraphs' that are broadly divided based on content or form.
  • a system 100 includes a first predictive model generating device 10, a second predictive model generating device 20, an input device 30, an arithmetic device 40, and an output device ( 50) included.
  • the system 100 according to the present invention is provided with a communication device (not shown), enabling mutual wired/wireless communication with external databases D through a communication network.
  • the system according to the present invention generates a predictive model using learning data and document data pre-stored in a separate database (not shown) including a separate database (not shown), and uses the generated predictive model.
  • a predictive model using learning data and document data pre-stored in a separate database (not shown) including a separate database (not shown), and uses the generated predictive model.
  • the database D may be an independent external database (dissertation database, medical information database, pharmaceutical information database, medical record database, health record database, etc.).
  • a model for predicting association between entities is generated through two steps.
  • a first step is to generate a first prediction model that learns first document data including at least one unit text including at least one word and derives relationship between words and contextual information between sentences. It is a step.
  • the first document data may include biological document data, More specifically, it may be an abstract of biological document data.
  • Biological document data can include any document that includes biological content, such as structure and function, growth, evolution, distribution, and classification of organisms.
  • the first document data may be medical document data, and more specifically, electronic medical record (EMR) data, electronic health record (EHR) data, personal health record (Personal Health Record, PHR) data, etc. may be included here.
  • EMR electronic medical record
  • EHR electronic health record
  • PHR Personal Health Record
  • the first prediction model generation device 10 includes a first tokenization module 11, a first masking module 12, a conversion module 13, a first learning module 14, and a first prediction accuracy calculation module 15 and a first predictive model generating module 16.
  • the first tokenization module 11 is configured to tokenize unit texts included in the first document data based on words or sub-words to generate a plurality of tokens.
  • a word or subword that is a criterion for tokenization may be determined based on a pre-built dictionary. That is, a dictionary may be constructed to include many words, and tokenization may be performed based on words included in the constructed dictionary.
  • generalization may be difficult because the shape of a word changes depending on various variables such as neologisms, typos, and plurals when only words in general text are used. Tokenization would be desirable.
  • the first masking module 12 is configured to mask some of the tokens generated by the first tokenization module 11 .
  • Masking means the operation of hiding the contents of a token, and the masked token is changed to a 'MASK' token.
  • the conversion module 13 is configured to convert some tokens masked by the first masking module 12 into tokens corresponding to words different from words before masking. For example, when the token 'cooked' is masked, the conversion module 13 converts it into a token 'ate' different from 'cooked'.
  • the first learning module 14 determines whether the corresponding token is an unconverted token (original ), or a converted token (replaced).
  • the first learning module 14 predicts 'original' or 'replaced' for all tokens of the first document data, rather than predicting 'original' or 'replaced' only for masked tokens.
  • the first prediction accuracy calculation module 15 is configured to calculate the prediction accuracy of the first learning module 14 . For example, the prediction accuracy of whether the token predicted by the first learning module 14 as 'original' is actually an unconverted token or the token predicted as 'replaced' is actually a converted token is calculated.
  • the first prediction model generation module 16 fits (adjusts the learning weight) the prediction model so that the prediction accuracy calculated by the first prediction accuracy calculation module 15 is maximized (ie, the loss value is minimized), 1 It is configured to generate a first prediction model that derives context information between unit texts and relationships between words included in document data.
  • the prediction accuracy for all tokens of the first document data is calculated and a prediction model is fitted to maximize the prediction accuracy, high learning efficiency is achieved even with a small number of document data.
  • the first prediction model is a model generated through a learning process of masking some of the tokens of the first document data to be learned and then predicting the masked tokens.
  • words and/or unit texts surrounding the masked token are learned during the learning process, not only relationship information between individual tokens is learned, but also information between unit texts is learned. That is, it is possible to learn relationship information between words and context information between unit texts included in document data through the first prediction model.
  • a second predictive model predicting associations between entities is generated through additional learning of the generated first predictive model.
  • the second prediction model generation device 20 includes a second masking module 21, a preprocessing module 22, a second learning module 23, a second prediction accuracy calculation module 24, and a second prediction model generation module 25 ).
  • second document data used for learning may be stored in order to generate a second predictive model.
  • the second document data may be pre-stored in the database D or may be stored in a separate storage module (not shown) of the second predictive model generating device 20 .
  • biological document data specifically abstracts of biological document data or medical document data may be used as training data. That is, the second document data is a concept including biological document data and medical document data. However, it is not limited to the listed examples, and any document data including at least two entities among diseases, genes, substances (drugs), and symptoms may be included in the second document data.
  • the second document data which is a learning target, includes two types of entities, whether or not there is an association between the two types of entities (which can be expressed as a value of 0 or 1), and whether the corresponding unit text predicts that there is relevance between the entities.
  • Second document data including 3 first words, 5 second words, and 4 unit texts will be described as an example.
  • Up to 270 (30 + 240) pieces of data that can be studied can be generated from one second document data.
  • First data consisting of pairs of associations (between words of the second type and words of the fourth type, between words of the third type and words of the fourth type).
  • Examples of the first data include 'gastroschisis' - 'maternal serum alpha-fetoprotein' - 'associated (can be expressed as a value of 1)', 'myopathy' - 'creatine kinase' - 'not associated (a value of 0) can be expressed)' may correspond.
  • Second it consists of a pair of two words of different types among the first to fourth type words and whether or not the two words are related, and whether the two words are a unit argument (clue) text determined to be related.
  • This is the second data.
  • An example of the second data is 'gastroschisis' - 'maternal serum alpha-fetoprotein' - 'associated (can be expressed as a value of 1)' - 'Second-trimester maternal serum alpha-fetoprotein levels in pregnancies associated with gastroschisis and omphalocele '(can be expressed as a value of 1), 'disease- 'gene' - 'associated (can be expressed as a value of 1) - '0 (means that the unit text is not a unit argument text)',' 'Disease' - 'drug' - 'not associated (can be expressed as a value of 0)' - '0 (meaning that the unit
  • the second masking module 21 masks non-learning target data to be excluded from the learning process. That is, the second masking module 21 may perform a role of naturally excluding data from the learning target by masking data having no correlation between two words among the second data.
  • the pre-processing module 22 is configured to pre-process the learning target data among the first data and the second data, and the second learning module 23 receives the pre-processed data and predicts whether or not there is an association between two words, and predicts the association. and the second prediction accuracy calculation module 24 is configured to calculate the prediction accuracy of the second learning module 23.
  • Two words of different types among random first to fourth type words and an abstract of the second document data are input to the preprocessing module 22, and the preprocessing module 22 determines the association between the two words.
  • the preprocessing module 22 distinguishes between the generated unit text in the form of a question and unit texts included in the abstract of the second document data, and adds a first token and a second token to the unit texts, for example, [CLS] tokens (first token) may be added to unit texts in the form of questions, and [SEP] tokens (second tokens) may be added to unit texts included in abstracts. That is, according to the content of the token added to the unit text, it is possible to identify whether the unit text is a unit text in the form of a question querying the association between two words or a unit text included in document data (furthermore, the corresponding unit text The number of sentences in the unit text can also be identified).
  • Unit texts included in the question form and the abstract of the second document data may also be tokenized based on words or subwords (the tokenization by the first tokenization module may be performed in the same manner).
  • the preprocessed tokens are input to the first prediction model generated by the first prediction model generator 10, and the first prediction model derives a hidden state for each input token.
  • Each of the hidden states derived from the first prediction model may have a multidimensional vector value and encode contextual information of tokens input to the first prediction model (the first prediction model is because it has been pretrained to derive relationship between words and contextual information between unit texts).
  • the second learning module 23 simultaneously learns whether or not the association (0 or 1) output from the association layer and whether or not the unit argument text (0 or 1) output from the clue layer is output. That is, the second prediction accuracy calculation module 24 calculates whether the association between the two derived words and whether or not the unit argument text is accurately derived, and calculates whether the association between the two words is erroneously derived (association loss). ) and the degree to which whether or not the unit argument text is incorrectly derived (clue sentence loss) is calculated, and the second learning module 23 learns the association loss and clue sentence loss at the same time.
  • the [CLS] token contains context information for confirming the association between two words
  • the [SEP] token contains context information for determining whether individual unit text is unit argument text.
  • association loss and clue text loss may be added to go through a backpropagation process.
  • the second prediction model generation module 25 generates a model in which the association loss and the clue text loss are minimized (eg, the value of the association loss + clue text loss is minimized) through multiple learning. It is configured to create
  • the generated second prediction model may be transmitted to the computing device 40 .
  • the calculation device 40 is separated from the second prediction model generation device 20 as a separate device, but the second prediction model generation device 20 and the calculation device 40 are integrated and function as one device. Of course you can.
  • a query document including a plurality of unit texts is input through the input device 30, and correlation prediction target words (two words of different types among first type words to fourth type words) are input.
  • the input device 30 is not particularly limited as long as it can receive a user command such as a touch panel, keyboard, or scanner and transmit the command to the system according to the present invention.
  • the query document and the association prediction words input through the input device 30 are queried to the input layer of the second prediction model, pre-processed through the pre-processing module 22, and the association layer and clue layer of the second prediction model.
  • the unit argument text (more specifically, whether or not each unit text included in the query document corresponds to the unit argument text) determined to be related and whether or not the association prediction words are related is output through the output layer.
  • the information (relevance/unit argument text) output through the output layer may be output in the form of information that is visible to the user through the output device 50 .
  • the output device 50 is not particularly limited as long as it is in a form such as a monitor or a display panel that can visually check the calculation result of the system according to the present invention.
  • the computing device 40 of the system may predict an association between three or more words by utilizing information output from the second prediction model, and output unit argument text of association prediction.
  • the process of predicting the association between three or more words may be performed using information output from the second prediction model as the query document data and two different types of words are input to the input device 30 .
  • two different types of words and first query document data are input, and the association between the two different types of words may be predicted. Thereafter, two different types of words and second query document data, including any one of the first to fourth type words, are input, and whether the words are related or not. can be predicted. That is, it is also possible to predict an association between three or more words through a word set query including one word in common.
  • first query document data and the second query document data may be the same document data or may be different document data.
  • the first predictive model generating device 10 learns the first document data and generates a first predictive model that predicts an association between words (S10).
  • the first tokenization module 11 tokenizes unit texts included in the first document data based on words to generate a plurality of tokens (S11).
  • the first masking module 12 masks some of the generated tokens (S12).
  • the conversion module 13 converts the masked token into a token corresponding to a word different from the word before masking (S13), and the first learning module 14 converts all generated tokens including the converted token without conversion Or, prediction is made using either transform (S14).
  • the first prediction accuracy calculation module 15 calculates the token prediction accuracy of the first learning module 14 for all tokens of the first document data to be learned (S15), and the first prediction model generation module (15) is a method for deriving the relationship between words included in document data and the context information between unit texts by adjusting the learning weights so that the token prediction accuracy of the first learning module 14 is maximized (loss is minimized). 1 prediction model is generated (S16).
  • the second predictive model generating device 20 additionally learns second document data in the first predictive model, and inputs a query document and two different types of words among first to fourth type words.
  • a second prediction model is generated that outputs whether or not the two input words are related and the unit argument text of the association prediction (S20).
  • the second masking module 21 masks data (non-learning target data) having no correlation between two words among the second data extracted from the previously stored second document data (S21).
  • the preprocessing module 22 preprocesses the two words and the second document data (specifically, the abstract of the second document data) in a predetermined method to generate preprocessed data (S22). Since details of the preprocessing method by the preprocessing module 22 have been described above, they will be omitted.
  • the data preprocessed by the preprocessing module 22 passes through the first prediction model, and a hidden state of each preprocessed data is derived (S23).
  • the derived hidden state passes through the association layer and clue layer of the second prediction model before learning weight adjustment, and outputs whether or not there is an association between two words and whether or not the unit argument text of each unit text included in the second document data is output It becomes (S24).
  • the second prediction model is generated by the second learning module 23 simultaneously learning whether or not there is a correlation output in step S24 and whether or not the unit argument text is present (S25). Specifically, the second learning module 23 simultaneously learns the accuracy of association prediction outputted in step S24 and the accuracy of unit argument text prediction, and the prediction accuracy calculated by the second prediction accuracy calculation module 24 (association prediction).
  • association prediction The second predictive model is generated by the second predictive model generation module 25 by adjusting the learning weights (learning parameters) so that the accuracy of , the accuracy of unit argument text prediction) is maximized.
  • the query document and the input layer of the two-word second prediction model are queried through the input device 30, and whether the two words queried through the output layer are related, and the association prediction
  • the unit argument text is output together. It may be output through the output device 50 as readable information so that the user can check it with the naked eye.
  • the association prediction accuracy is significantly higher than that of the first comparative example in which only the association was learned, and the unit argument text prediction was compared to the second comparative example in which only the unit argument text was learned. It was confirmed that the accuracy was significantly high. That is, in the present invention, it is possible to output both the association between two words of different types and the unit argument text of association prediction, and a comparative example (first comparative example) that predicts only association in terms of performance, unit It was confirmed that it was superior to the comparative example (second comparative example) that predicted only the argument text.
  • All or at least part of the configuration of the system according to an embodiment of the present invention may be implemented in the form of a hardware module or a software module, or may be implemented in a combination of a hardware module and a software module.
  • a software module may be understood as, for example, a command executed by a processor that controls an operation in a system, and such a command may have a form loaded in a memory of a disease-related factor prediction system.
  • the method according to the present invention described above may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium.
  • the computer readable medium may include program instructions, data files, data structures, etc. alone or in combination.
  • Program instructions recorded on the medium may be those specially designed and configured for the present invention or those known and usable to those skilled in computer software.
  • Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks.
  • - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like.
  • Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler.
  • the hardware device described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Epidemiology (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Primary Health Care (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Pathology (AREA)
  • Biomedical Technology (AREA)
  • Physiology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioethics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un procédé de génération d'un modèle qui prédit une corrélation entre deux mots de différents types parmi des entités comprenant une maladie, un gène, un matériel et un symptôme examinés dans un document d'examen et qui délivre une phrase d'argument de la prédiction de corrélation conjointement avec celles-ci; et un système utilisant le modèle généré.
PCT/KR2022/020686 2021-12-21 2022-12-19 Procédé de génération de modèle qui prédit une corrélation entre des entités comprenant une maladie, un gène, un matériel et un symptôme à partir de données de document et qui délivre un texte d'argument d'unité et système utilisant ledit procédé WO2023121165A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
KR10-2021-0183991 2021-12-21
KR1020210183991A KR102426508B1 (ko) 2021-12-21 2021-12-21 문서 데이터로부터 질병과 유전자 간의 연관성을 예측하고 논거 문장을 출력하는 모델의 구축 방법 및 이를 이용한 시스템
KR10-2022-0090623 2021-12-21
KR1020220090623A KR102497200B1 (ko) 2021-12-21 2022-07-21 문서 데이터로부터 질병과 유전자 간의 연관성을 예측하고 논거 문장을 출력하는 모델의 구축 방법 및 이를 이용한 시스템

Publications (1)

Publication Number Publication Date
WO2023121165A1 true WO2023121165A1 (fr) 2023-06-29

Family

ID=82606380

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2022/020686 WO2023121165A1 (fr) 2021-12-21 2022-12-19 Procédé de génération de modèle qui prédit une corrélation entre des entités comprenant une maladie, un gène, un matériel et un symptôme à partir de données de document et qui délivre un texte d'argument d'unité et système utilisant ledit procédé

Country Status (2)

Country Link
KR (2) KR102426508B1 (fr)
WO (1) WO2023121165A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102426508B1 (ko) * 2021-12-21 2022-07-29 주식회사 스탠다임 문서 데이터로부터 질병과 유전자 간의 연관성을 예측하고 논거 문장을 출력하는 모델의 구축 방법 및 이를 이용한 시스템

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200080571A (ko) * 2018-12-27 2020-07-07 에스케이 주식회사 키워드와 관계 정보를 이용한 정보 검색 시스템 및 방법
KR102233464B1 (ko) * 2020-08-13 2021-03-30 주식회사 스탠다임 문서 데이터에서 질병 관련 인자들 간의 관계를 추출하는 방법 및 이를 이용하여 구축되는 시스템
US20210286948A1 (en) * 2016-10-05 2021-09-16 National Institute Of Information And Communications Technology Causality recognizing apparatus and computer program therefor
KR102426508B1 (ko) * 2021-12-21 2022-07-29 주식회사 스탠다임 문서 데이터로부터 질병과 유전자 간의 연관성을 예측하고 논거 문장을 출력하는 모델의 구축 방법 및 이를 이용한 시스템

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210286948A1 (en) * 2016-10-05 2021-09-16 National Institute Of Information And Communications Technology Causality recognizing apparatus and computer program therefor
KR20200080571A (ko) * 2018-12-27 2020-07-07 에스케이 주식회사 키워드와 관계 정보를 이용한 정보 검색 시스템 및 방법
KR102233464B1 (ko) * 2020-08-13 2021-03-30 주식회사 스탠다임 문서 데이터에서 질병 관련 인자들 간의 관계를 추출하는 방법 및 이를 이용하여 구축되는 시스템
KR102426508B1 (ko) * 2021-12-21 2022-07-29 주식회사 스탠다임 문서 데이터로부터 질병과 유전자 간의 연관성을 예측하고 논거 문장을 출력하는 모델의 구축 방법 및 이를 이용한 시스템

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MORTEZA POURREZA SHAHRI, MANDI M. ROE, GILLIAN REYNOLDS, INDIKA KAHANDA: "PPPred: Classifying Protein-phenotype Co-mentions Extracted from Biomedical Literature", BIORXIV, 31 May 2019 (2019-05-31), pages 1 - 9, XP093074505 *
SU JUNHAO, WU YE, TING HING-FUNG, LAM TAK-WAH, LUO RUIBANG: "RENET2: high-performance full-text gene–disease relation extraction with iterative training data expansion", NAR GENOMICS AND BIOINFORMATICS, vol. 3, no. 3, 23 June 2021 (2021-06-23), pages 1 - 10, XP093074501, DOI: 10.1093/nargab/lqab062 *

Also Published As

Publication number Publication date
KR102426508B1 (ko) 2022-07-29
KR102497200B1 (ko) 2023-02-08

Similar Documents

Publication Publication Date Title
EP4097726A1 (fr) Procédé de prédiction d'une maladie, d'un gène ou d'une protéine liés à une entité interrogée et système de prédiction créé à l'aide de cette dernière
Ammar et al. Many languages, one parser
WO2023121165A1 (fr) Procédé de génération de modèle qui prédit une corrélation entre des entités comprenant une maladie, un gène, un matériel et un symptôme à partir de données de document et qui délivre un texte d'argument d'unité et système utilisant ledit procédé
WO2021132927A1 (fr) Dispositif informatique et procédé de classification de catégorie de données
WO2017217661A1 (fr) Appareil d'intégration de sens de mot et procédé utilisant un réseau sémantique lexical, et appareil et procédé de discrimination d'homographe utilisant un réseau sémantique lexical et une intégration de mot
WO2020204586A1 (fr) Système de recommandation de candidat de repositionnement de médicament, et programme informatique stocké dans un support afin d'exécuter chaque fonction de système
WO2018034426A1 (fr) Procédé de correction automatique d'erreurs dans un corpus balisé à l'aide de règles pdr de noyau
WO2021182921A1 (fr) Procédé de détermination de similarité de technologie faisant appel à un réseau neuronal
WO2015023035A1 (fr) Procédé de correction d'erreurs de préposition et dispositif le réalisant
WO2019112117A1 (fr) Procédé et programme informatique pour inférer des méta-informations d'un créateur de contenu textuel
Vega et al. MineriaUNAM at SemEval-2019 task 5: Detecting hate speech in Twitter using multiple features in a combinatorial framework
Zennaki et al. Unsupervised and lightly supervised part-of-speech tagging using recurrent neural networks
WO2022191368A1 (fr) Procédé et dispositif de traitement de données pour l'apprentissage d'un réseau neuronal qui catégorise une intention en langage naturel
WO2022035074A1 (fr) Procédé pour extraire une relation entre des facteurs liés à une maladie à partir de données de document, et système construit à l'aide de celui-ci
Khanuja et al. Mergedistill: Merging pre-trained language models using distillation
WO2024080783A1 (fr) Appareil et procédé de génération d'informations de tcr correspondant à un cmhp au moyen d'une technologie d'intelligence artificielle
WO2021246812A1 (fr) Solution et dispositif d'analyse de niveau de positivité d'actualités utilisant un modèle nlp à apprentissage profond
WO2022030670A1 (fr) Système et procédé d'apprentissage profond par cadre utilisant une requête
WO2022235073A1 (fr) Procédé de guidage d'amélioration des compétences de lecture et d'écriture, et dispositif associé
WO2023195769A1 (fr) Procédé d'extraction de documents de brevets similaires à l'aide d'un modèle de réseau neuronal, et appareil pour sa fourniture
Jones Non-hybrid example-based machine translation architectures
Kanojia et al. Harnessing cross-lingual features to improve cognate detection for low-resource languages
JP3123836B2 (ja) テキスト型データベース装置
WO2022191459A1 (fr) Procédé de conception de médicament et dispositif mettant en œuvre un tel procédé
Todase et al. Script translation system for devnagari to english

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22911798

Country of ref document: EP

Kind code of ref document: A1