WO2021174695A1 - Procédé de reconnaissance de médicaments basé sur un apprentissage automatique, et dispositif associé - Google Patents

Procédé de reconnaissance de médicaments basé sur un apprentissage automatique, et dispositif associé Download PDF

Info

Publication number
WO2021174695A1
WO2021174695A1 PCT/CN2020/093319 CN2020093319W WO2021174695A1 WO 2021174695 A1 WO2021174695 A1 WO 2021174695A1 CN 2020093319 W CN2020093319 W CN 2020093319W WO 2021174695 A1 WO2021174695 A1 WO 2021174695A1
Authority
WO
WIPO (PCT)
Prior art keywords
drug
sentence
medicine
sample
vector sequence
Prior art date
Application number
PCT/CN2020/093319
Other languages
English (en)
Chinese (zh)
Inventor
顾大中
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021174695A1 publication Critical patent/WO2021174695A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition

Definitions

  • This application relates to the field of artificial intelligence entity recognition technology, and in particular to a method, device, computer equipment, and computer-readable storage medium for drug recognition based on machine learning.
  • the first aspect of the present application provides a method for drug identification based on machine learning, the method including:
  • each first drug sentence sample in the first drug sentence sample set includes a missing word and a missing word label
  • each second drug sentence sample in the second drug sentence sample set contains a chemical substance label
  • each third drug sentence sample set in the third drug sentence sample set contains a therapeutic substance label
  • the coding model is used to extract the vector sequence of the drug sentence to be recognized, the chemical substance recognition model is used to obtain the chemical substance entity set by recognizing the vector sequence of the drug sentence to be recognized, and the therapeutic substance recognition model is used to identify the State the vector sequence of the drug sentence to be identified to obtain the therapeutic substance entity set;
  • the substance entities existing in both the chemical substance entity set and the therapeutic substance entity set are identified as drugs.
  • a second aspect of the present application provides a medicine identification device based on machine learning, the device comprising:
  • the first acquisition module is configured to acquire a first drug sentence sample set, where each first drug sentence sample in the first drug sentence sample set contains a missing word and a missing word label;
  • the first training module is used to train a coding model with the first sample set of medicine sentences
  • the second acquisition module is configured to acquire a second drug sentence sample set and a third drug sentence sample set, each second drug sentence sample in the second drug sentence sample set contains a chemical substance label, and the third drug sentence sample set Each third drug sentence sample in the sample set contains a therapeutic substance label;
  • the second training module is used to extract the vector sequence of the second drug sentence sample using the coding model, and use the vector sequence of the second drug sample as input to train according to the chemical substance label of the second drug sample Chemical substance identification model;
  • the third training module is used to extract the vector sequence of the third drug sentence sample using the coding model, and use the vector sequence of the third drug sample as input to train according to the therapeutic substance label of the third drug sample Therapeutic substance identification model;
  • the third acquisition module is used to acquire the drug sentence to be identified
  • the first recognition module is used to extract the vector sequence of the drug sentence to be recognized using the coding model, and to obtain the chemical substance entity set by recognizing the vector sequence of the drug sentence to be recognized by the chemical substance recognition model, and use the
  • the therapeutic substance recognition model obtains the therapeutic substance entity set by recognizing the vector sequence of the drug sentence to be recognized;
  • the second identification module is used to identify the substance entities that exist in both the chemical substance entity set and the therapeutic substance entity set as drugs.
  • a third aspect of the present application provides a computer device.
  • the computer device includes a memory and a processor, and the processor is configured to execute computer-readable instructions stored in the memory to implement the following steps:
  • each first drug sentence sample in the first drug sentence sample set includes a missing word and a missing word label
  • each second drug sentence sample in the second drug sentence sample set contains a chemical substance label
  • each third drug sentence sample set in the third drug sentence sample set contains a therapeutic substance label
  • the coding model is used to extract the vector sequence of the drug sentence to be recognized, the chemical substance recognition model is used to obtain the chemical substance entity set by recognizing the vector sequence of the drug sentence to be recognized, and the therapeutic substance recognition model is used to identify the State the vector sequence of the drug sentence to be identified to obtain the therapeutic substance entity set;
  • the substance entities existing in both the chemical substance entity set and the therapeutic substance entity set are identified as drugs.
  • the fourth aspect of the present application provides one or more readable storage media storing computer readable instructions.
  • the computer readable instructions When executed by one or more processors, the one or more processors execute the following step:
  • each first drug sentence sample in the first drug sentence sample set includes a missing word and a missing word label
  • each second drug sentence sample in the second drug sentence sample set contains a chemical substance label
  • each third drug sentence sample set in the third drug sentence sample set contains a therapeutic substance label
  • the coding model is used to extract the vector sequence of the drug sentence to be recognized, the chemical substance recognition model is used to obtain the chemical substance entity set by recognizing the vector sequence of the drug sentence to be recognized, and the therapeutic substance recognition model is used to identify the State the vector sequence of the drug sentence to be identified to obtain the therapeutic substance entity set;
  • the substance entities existing in both the chemical substance entity set and the therapeutic substance entity set are identified as drugs.
  • This application uses the coding model to extract the vector sequence of the second drug sentence sample and the vector sequence of the third drug sentence sample to improve the efficiency of training the chemical substance recognition model and the therapeutic substance recognition model, respectively. Recognizing the chemical substance entity in the drug sentence to be recognized is more stable than recognizing the therapeutic substance entity in the drug sentence to be recognized, and recognizing the substance entity existing in both the chemical substance entity set and the therapeutic substance entity set as a drug , Which reduces the risk of misidentifying drugs. Therefore, the present application realizes the identification of the medicine in the medicine sentence to be identified, and improves the efficiency and accuracy of medicine identification.
  • the details of one or more embodiments of the present application are presented in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings and claims.
  • Fig. 1 is a flowchart of a method for medicine identification based on machine learning provided by an embodiment of the present application.
  • Fig. 2 is a structural diagram of a medicine identification device based on machine learning provided by an embodiment of the present application.
  • Fig. 3 is a schematic diagram of a computer device provided by an embodiment of the present application.
  • the machine learning-based drug identification method of the present application is applied in one or more computer devices.
  • the computer device is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
  • Its hardware includes, but is not limited to, a microprocessor and an application specific integrated circuit (ASIC) , Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC application specific integrated circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Processor
  • embedded equipment etc.
  • the computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.
  • FIG. 1 is a flowchart of a method for medicine identification based on machine learning provided in Embodiment 1 of the present application.
  • the medicine identification method based on machine learning is applied to a computer device for identifying medicines in medicine sentences to be identified.
  • the machine learning-based drug identification method includes:
  • each first drug sentence sample in the first drug sentence sample set includes a missing word and a missing word label.
  • the obtaining a sample set of a first medicine sentence includes:
  • an optical scanner or a digital camera can be used to obtain a book image of a paper version of a medical book; the book image can be binarized, and the book image can be converted into a black and white image by setting a preset binarization threshold; Preprocessing such as noise and tilt correction; perform text recognition based on neural network or distance on the preprocessed black and white image.
  • a web crawler can be used to crawl electronic medical documents from Chinese journal literature databases (such as Wusing, HowNet) or Baidu Baike with keywords “ingredients”, “Chinese medicine” (or Chinese medicine names), etc.
  • the scanned paper version of the medical book and the grabbed electronic version of the medical document can be segmented, and the sentence can be deduplicated to obtain multiple drug sentences.
  • a drug sentence is "Metformin is a white powder”
  • a word in the drug sentence is randomly selected as the missing word
  • a drug sentence with the missing word is obtained " ⁇ S>Metformin is a white powder ⁇ mask> ⁇ E>”; where " ⁇ S>” indicates the head word of the drug sentence, " ⁇ E>” indicates the end word of the drug sentence, ⁇ mask> indicates the missing word of the drug sentence, and "powder” is The missing word tag of the drug sentence.
  • the coding model may be a BERT model or a word embedding model.
  • training the coding model by using the first sample set of medicine sentences includes:
  • the BERT model and the preset fully connected layer are optimized according to the missing word vector and label of the first medical sentence sample.
  • said generating the input vector sequence of each first medicine sentence sample may include:
  • the words "metformin is a white powder" contained in the first drug sentence sample are obtained.
  • the Stanford word segmentation tool can be used to segment the first drug sentence sample, or the method based on statistics and string matching can be used to segment the first drug sentence sample.
  • the preset word encoding table may adopt one-hot, word2vec, etc. encoding methods, and the encoding vector of each word corresponds to the word one-to-one.
  • the coding vector of a word of a first medicine sentence sample is a 10-dimensional vector
  • the position vector is a 2-dimensional vector
  • the coding input vector of the word is a 12-dimensional concatenation of the word’s coding vector and the word’s position vector. vector.
  • each second drug sentence sample in the second drug sentence sample set contains a chemical substance label
  • each of the third drug sentence sample sets contains a therapeutic substance label.
  • a second drug sentence sample in the second drug sentence sample set is "Metformin is a white water-soluble powder", and the label of the second drug sentence sample is "BH IH IH IH O O O O O O O O O O BH O O O".
  • a third drug sentence sample in the third drug sentence sample set is "Metformin is often used as the main drug in the treatment of diabetes", and the label of the third drug sentence sample is "O O O O O O BZ IZ IZ IZ O O O O O O O O”.
  • "O” is a non-named entity
  • BH is the starting label of the chemical substance label
  • IH is the middle label of the chemical substance label
  • BZ is the starting label of the therapeutic substance label
  • IZ is the therapeutic substance label.
  • the middle label of the label is the starting label of the chemical substance label.
  • the chemical substance identification model includes:
  • the chemical substance recognition model is composed of a long and short-term memory network and a conditional random field connected to the long- and short-term memory network;
  • the coding model can be used to extract the vector sequence of the second drug sentence sample, and the long and short-term memory network can be used to extract the second drug
  • the context and semantic features of the sentence sample are used to obtain the intermediate vector sequence of the second drug sentence sample;
  • the intermediate vector sequence is used as input and the conditional random field is used to output the chemical substance prediction label of the second drug sentence sample, which is optimized according to the chemical substance label and the chemical substance prediction label Parameters of long and short-term memory networks and conditional random fields.
  • the therapeutic substance identification model includes:
  • the therapeutic substance identification model is composed of a bidirectional long and short-term memory network and a conditional random field connected to the bidirectional long and short-term memory network;
  • the coding model can be used to extract the vector sequence of the third drug sentence sample, and the bidirectional long and short-term memory network can be used to extract the vector sequence
  • the context and semantic features of the third drug sentence sample are used to obtain the intermediate vector sequence of the third drug sentence sample; the intermediate vector sequence is used as input and the conditional random field is used to output the therapeutic substance prediction label of the third drug sentence sample according to the therapeutic substance label and the therapeutic substance
  • the predicted label optimizes the parameters of the bidirectional long-term short-term memory network and the conditional random field.
  • the drug sentence to be identified can be "Metformin is a white water-soluble powder. Metformin is often used as the main drug in the treatment of diabetes, and the patient should strictly control the intake of glucose during the medication.”
  • the coding model is used to extract the vector sequence of the drug sentence to be recognized, and the chemical substance entity set obtained by recognizing the vector sequence of the drug sentence to be recognized by the trained chemical substance recognition model is ⁇ metformin, metformin, glucose ⁇ , using
  • the therapeutic substance entity set obtained by the therapeutic substance recognition model by recognizing the vector sequence of the drug sentence to be recognized is ⁇ metformin ⁇ .
  • the chemical substance entity set is ⁇ metformin, metformin, glucose ⁇ and the therapeutic substance entity set is ⁇ metformin ⁇ , then the substance entity "metformin" that exists in both the chemical substance entity set and the therapeutic substance entity set is identified as a drug.
  • the medicine identification method based on machine learning of the first embodiment acquires a first medicine sentence sample set, and each first medicine sentence sample in the first medicine sentence sample set contains a missing word and a missing word label; using the first medicine sentence sample set;
  • the coding model is trained on the drug sentence sample set; the second drug sentence sample set and the third drug sentence sample set are acquired, each second drug sentence sample in the second drug sentence sample set contains a chemical substance label, and the third drug
  • Each third drug sentence sample in the sentence sample set contains a therapeutic substance label;
  • the coding model is used to extract the vector sequence of the second drug sentence sample, and the vector sequence of the second drug sample is used as the input to determine the
  • the chemical substance label of the second drug sample trains a chemical substance identification model; the coding model is used to extract the vector sequence of the third drug sentence sample, and the vector sequence of the third drug sample is used as input to determine the 3.
  • the therapeutic substance label of the drug sample trains the therapeutic substance recognition model; obtains the drug sentence to be recognized; uses the coding model to extract the vector sequence of the drug sentence to be recognized, and uses the chemical substance recognition model to identify the drug sentence to be recognized To obtain a chemical substance entity set, and use the therapeutic substance recognition model to obtain a therapeutic substance entity set by recognizing the vector sequence of the drug sentence to be recognized; both the chemical substance entity set and the therapeutic substance entity set exist
  • the physical entity is identified as a drug.
  • Using the coding model to extract the vector sequence of the second drug sentence sample and the vector sequence of the third drug sentence sample improves the efficiency of training the chemical substance recognition model and the therapeutic substance recognition model, respectively.
  • Recognizing the chemical substance entity in the drug sentence to be recognized is more stable than recognizing the therapeutic substance entity in the drug sentence to be recognized, and recognizing the substance entity existing in both the chemical substance entity set and the therapeutic substance entity set as a drug , which reduces the risk of misidentifying drugs. Therefore, in the first embodiment, the medicine in the medicine sentence to be identified is recognized, which improves the efficiency and accuracy of medicine identification.
  • the method further includes:
  • Outputting the material entities that are concentrated in the chemical substance entities and not in the treatment material entities can avoid misidentification, and send an identification reminder to the user, and receive the user's determination result.
  • the method further includes:
  • Two drugs appearing in a drug sentence can be connected in the knowledge graph to reflect the connection between drugs.
  • Fig. 2 is a structural diagram of a medicine identification device based on machine learning provided in the second embodiment of the present application.
  • the medicine identification device 20 based on machine learning is applied to a computer device.
  • the medicine identification device 20 based on machine learning is used to identify the medicine in the medicine sentence to be identified.
  • the machine learning-based drug identification device 20 may include a first acquisition module 201, a first training module 202, a second acquisition module 203, a second training module 204, a third training module 205, and a third The acquisition module 206, the first identification module 207, and the second identification module 208.
  • the first obtaining module 201 is configured to obtain a first drug sentence sample set, and each first drug sentence sample in the first drug sentence sample set includes a missing word and a missing word label.
  • the obtaining a sample set of a first medicine sentence includes:
  • an optical scanner or a digital camera can be used to obtain a book image of a paper version of a medical book; the book image can be binarized, and the book image can be converted into a black and white image by setting a preset binarization threshold; Preprocessing such as noise and tilt correction; perform text recognition based on neural network or distance on the preprocessed black and white image.
  • a web crawler can be used to crawl electronic medical documents from Chinese journal literature databases (such as Wusing, HowNet) or Baidu Baike with keywords “ingredients”, “Chinese medicine” (or Chinese medicine names), etc.
  • the scanned paper version of the medical book and the grabbed electronic version of the medical document can be segmented, and the sentence can be deduplicated to obtain multiple drug sentences.
  • a drug sentence is "Metformin is a white powder”
  • a word in the drug sentence is randomly selected as the missing word
  • a drug sentence with the missing word is obtained " ⁇ S>Metformin is a white powder ⁇ mask> ⁇ E>”; where " ⁇ S>” indicates the head word of the drug sentence, " ⁇ E>” indicates the end word of the drug sentence, ⁇ mask> indicates the missing word of the drug sentence, and "powder” is The missing word tag of the drug sentence.
  • the first training module 202 is used to train the coding model with the first sample set of medicine sentences.
  • the coding model may be a BERT model or a word embedding model.
  • training the coding model by using the first sample set of medicine sentences includes:
  • the BERT model and the preset fully connected layer are optimized according to the missing word vector and label of the first medical sentence sample.
  • said generating the input vector sequence of each first medicine sentence sample may include:
  • the words "metformin is a white powder" contained in the first drug sentence sample are obtained.
  • the Stanford word segmentation tool can be used to segment the first drug sentence sample, or the method based on statistics and string matching can be used to segment the first drug sentence sample.
  • the preset word encoding table may adopt one-hot, word2vec, etc. encoding methods, and the encoding vector of each word corresponds to the word one-to-one.
  • the coding vector of a word of a first medicine sentence sample is a 10-dimensional vector
  • the position vector is a 2-dimensional vector
  • the coding input vector of the word is a 12-dimensional concatenation of the word’s coding vector and the word’s position vector. vector.
  • the second acquisition module 203 is configured to acquire a second medicine sentence sample set and a third medicine sentence sample set, each second medicine sentence sample in the second medicine sentence sample set contains a chemical substance label, and the third medicine Each third drug sentence sample in the sentence sample set contains a therapeutic substance label.
  • a second drug sentence sample in the second drug sentence sample set is "Metformin is a white water-soluble powder", and the label of the second drug sentence sample is "BH IH IH IH O O O O O O O O O O BH O O O".
  • a third drug sentence sample in the third drug sentence sample set is "Metformin is often used as the main drug in the treatment of diabetes", and the label of the third drug sentence sample is "O O O O O O BZ IZ IZ IZ O O O O O O O O”.
  • "O” is a non-named entity
  • BH is the starting label of the chemical substance label
  • IH is the middle label of the chemical substance label
  • BZ is the starting label of the therapeutic substance label
  • IZ is the therapeutic substance label.
  • the middle label of the label is the starting label of the chemical substance label.
  • the second training module 204 is configured to use the coding model to extract the vector sequence of the second drug sentence sample, and use the vector sequence of the second drug sample as input to determine the chemical substance label of the second drug sample Train the chemical substance recognition model.
  • the chemical substance identification model includes:
  • the chemical substance recognition model is composed of a long and short-term memory network and a conditional random field connected to the long- and short-term memory network;
  • the coding model can be used to extract the vector sequence of the second drug sentence sample, and the long and short-term memory network can be used to extract the second drug
  • the context and semantic features of the sentence sample are used to obtain the intermediate vector sequence of the second drug sentence sample;
  • the intermediate vector sequence is used as input and the conditional random field is used to output the chemical substance prediction label of the second drug sentence sample, which is optimized according to the chemical substance label and the chemical substance prediction label Parameters of long and short-term memory networks and conditional random fields.
  • the third training module 205 is configured to use the coding model to extract the vector sequence of the third drug sentence sample, and use the vector sequence of the third drug sample as input to use the therapeutic substance label of the third drug sample as input. Train the therapeutic substance recognition model.
  • the therapeutic substance identification model includes:
  • the therapeutic substance identification model is composed of a bidirectional long and short-term memory network and a conditional random field connected to the bidirectional long and short-term memory network;
  • the coding model can be used to extract the vector sequence of the third drug sentence sample, and the bidirectional long and short-term memory network can be used to extract the vector sequence
  • the context and semantic features of the third drug sentence sample are used to obtain the intermediate vector sequence of the third drug sentence sample; the intermediate vector sequence is used as input and the conditional random field is used to output the therapeutic substance prediction label of the third drug sentence sample according to the therapeutic substance label and the therapeutic substance
  • the predicted label optimizes the parameters of the bidirectional long-term short-term memory network and the conditional random field.
  • the third acquiring module 206 is used to acquire the medicine sentence to be identified.
  • the drug sentence to be identified can be "Metformin is a white water-soluble powder. Metformin is often used as the main drug in the treatment of diabetes, and the patient should strictly control the intake of glucose during the medication.”
  • the first recognition module 207 is used to extract the vector sequence of the drug sentence to be recognized using the coding model, and to obtain the chemical substance entity set by recognizing the vector sequence of the drug sentence to be recognized by the chemical substance recognition model, and
  • the therapeutic substance recognition model obtains the therapeutic substance entity set by recognizing the vector sequence of the drug sentence to be recognized.
  • the coding model is used to extract the vector sequence of the drug sentence to be recognized, and the chemical substance entity set obtained by recognizing the vector sequence of the drug sentence to be recognized by the trained chemical substance recognition model is ⁇ metformin, metformin, glucose ⁇ , using
  • the therapeutic substance entity set obtained by the therapeutic substance recognition model by recognizing the vector sequence of the drug sentence to be recognized is ⁇ metformin ⁇ .
  • the second identification module 208 is configured to identify the substance entities that exist in both the chemical substance entity set and the therapeutic substance entity set as drugs.
  • the chemical substance entity set is ⁇ metformin, metformin, glucose ⁇ and the therapeutic substance entity set is ⁇ metformin ⁇ , then the substance entity "metformin" that exists in both the chemical substance entity set and the therapeutic substance entity set is identified as a drug.
  • the machine learning-based medicine recognition device 20 of the second embodiment acquires a first medicine sentence sample set, and each first medicine sentence sample in the first medicine sentence sample set contains a missing word and a missing word label; using the first medicine sentence sample set; A sample set of drug sentences to train the coding model; a second sample set of drug sentences and a third sample set of drug sentences are acquired, each second sample of drug sentences in the second sample set of drug sentences contains a chemical substance label, and the third Each third drug sentence sample in the drug sentence sample set contains a therapeutic substance label; the coding model is used to extract the vector sequence of the second drug sentence sample, and the vector sequence of the second drug sample is used as the input, according to The chemical substance label of the second drug sample trains a chemical substance recognition model; the coding model is used to extract the vector sequence of the third drug sentence sample, and the vector sequence of the third drug sample is used as input to determine the The therapeutic substance label of the third drug sample trains the therapeutic substance recognition model; obtains the drug sentence to be recognized; extracts the vector sequence of the drug sentence to be recognized
  • Using the coding model to extract the vector sequence of the second drug sentence sample and the vector sequence of the third drug sentence sample improves the efficiency of training the chemical substance recognition model and the therapeutic substance recognition model, respectively. Recognizing the chemical substance entity in the drug sentence to be recognized is more stable than recognizing the therapeutic substance entity in the drug sentence to be recognized, and recognizing the substance entity existing in both the chemical substance entity set and the therapeutic substance entity set as a drug , which reduces the risk of misidentifying drugs. Therefore, in the second embodiment, the medicine in the medicine sentence to be identified is identified, which improves the efficiency and accuracy of medicine identification.
  • the machine learning-based drug identification device 20 further includes: a sending module, configured to output the substance entities that are concentrated in the chemical substance entities and that do not exist in the therapeutic substance entities; and send an identification reminder.
  • the device 20 for recognizing drugs based on machine learning may further include a building module for constructing a drug knowledge graph with the recognized drugs.
  • This embodiment provides one or more readable storage media storing computer readable instructions.
  • the readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage media;
  • the steps in the above-mentioned embodiment of the machine learning-based medicine identification method are implemented, for example, steps 101-108 shown in FIG. 1:
  • each first drug sentence sample in the first drug sentence sample set includes a missing word and a missing word label
  • each second drug sentence sample in the second drug sentence sample set contains a chemical substance label
  • each of the third drug sentence sample sets contains a therapeutic substance label
  • each module in the above-mentioned device embodiment is realized, for example, the modules 201-208 in FIG. 2:
  • the first obtaining module 201 is configured to obtain a first drug sentence sample set, where each first drug sentence sample in the first drug sentence sample set includes a missing word and a missing word label;
  • the first training module 202 is configured to train a coding model by using the first sample set of medicine sentences
  • the second acquisition module 203 is configured to acquire a second medicine sentence sample set and a third medicine sentence sample set, each second medicine sentence sample in the second medicine sentence sample set contains a chemical substance label, and the third medicine Each third drug sentence sample in the sentence sample set contains a therapeutic substance label;
  • the second training module 204 is configured to use the coding model to extract the vector sequence of the second drug sentence sample, and use the vector sequence of the second drug sample as input to determine the chemical substance label of the second drug sample Training chemical substance recognition model;
  • the third training module 205 is configured to use the coding model to extract the vector sequence of the third drug sentence sample, and use the vector sequence of the third drug sample as input to use the therapeutic substance label of the third drug sample as input. Training the therapeutic substance recognition model;
  • the third obtaining module 206 is used to obtain the drug sentence to be identified
  • the first recognition module 207 is used to extract the vector sequence of the drug sentence to be recognized using the coding model, and to obtain the chemical substance entity set by recognizing the vector sequence of the drug sentence to be recognized by the chemical substance recognition model, and
  • the therapeutic substance recognition model obtains the therapeutic substance entity set by recognizing the vector sequence of the drug sentence to be recognized;
  • the second identification module 208 is configured to identify the substance entities that exist in both the chemical substance entity set and the therapeutic substance entity set as drugs.
  • FIG. 3 is a schematic diagram of a computer device provided in Embodiment 3 of this application.
  • the computer device 30 includes a memory 301, a processor 302, and computer-readable instructions 303 stored in the memory 301 and running on the processor 302, such as a medicine recognition program based on machine learning.
  • the processor 302 executes the computer-readable instruction 303, the steps in the embodiment of the above-mentioned machine learning-based medicine identification method are implemented, for example, 101-108 shown in FIG. 1:
  • each first drug sentence sample in the first drug sentence sample set includes a missing word and a missing word label
  • each second drug sentence sample in the second drug sentence sample set contains a chemical substance label
  • each of the third drug sentence sample sets contains a therapeutic substance label
  • each module in the above-mentioned device embodiment is realized, for example, the modules 201-208 in FIG. 2:
  • the first obtaining module 201 is configured to obtain a first drug sentence sample set, where each first drug sentence sample in the first drug sentence sample set includes a missing word and a missing word label;
  • the first training module 202 is configured to train a coding model by using the first sample set of medicine sentences
  • the second acquisition module 203 is configured to acquire a second medicine sentence sample set and a third medicine sentence sample set, each second medicine sentence sample in the second medicine sentence sample set contains a chemical substance label, and the third medicine Each third drug sentence sample in the sentence sample set contains a therapeutic substance label;
  • the second training module 204 is configured to use the coding model to extract the vector sequence of the second drug sentence sample, and use the vector sequence of the second drug sample as input to determine the chemical substance label of the second drug sample Training chemical substance recognition model;
  • the third training module 205 is configured to use the coding model to extract the vector sequence of the third drug sentence sample, and use the vector sequence of the third drug sample as input to use the therapeutic substance label of the third drug sample as input. Training the therapeutic substance recognition model;
  • the third obtaining module 206 is used to obtain the drug sentence to be identified
  • the first recognition module 207 is used to extract the vector sequence of the drug sentence to be recognized using the coding model, and to obtain the chemical substance entity set by recognizing the vector sequence of the drug sentence to be recognized by the chemical substance recognition model, and
  • the therapeutic substance recognition model obtains the therapeutic substance entity set by recognizing the vector sequence of the drug sentence to be recognized;
  • the second identification module 208 is configured to identify the substance entities that exist in both the chemical substance entity set and the therapeutic substance entity set as drugs.
  • the computer-readable instruction 303 may be divided into one or more modules, and the one or more modules are stored in the memory 301 and executed by the processor 302 to complete the method.
  • the one or more modules may be a series of computer-readable instruction instruction segments capable of completing specific functions, and the instruction segment is used to describe the execution process of the computer-readable instruction 303 in the computer device 30.
  • the computer-readable instruction 303 can be divided into the first acquisition module 201, the first training module 202, the second acquisition module 203, the second training module 204, the third training module 205, and the third acquisition module in FIG.
  • the module 206, the first identification module 207, and the second identification module 208 refer to the second embodiment for the specific functions of each module.
  • the schematic diagram 3 is only an example of the computer device 30, and does not constitute a limitation on the computer device 30. It may include more or less components than those shown in the figure, or combine certain components, or different components.
  • the computer device 30 may also include input and output devices, network access devices, buses, and so on.
  • the so-called processor 302 may be a central processing unit (Central Processing Unit, CPU), other general processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor 302 can also be any conventional processor, etc.
  • the processor 302 is the control center of the computer device 30, which uses various interfaces and lines to connect the entire computer device 30. Various parts.
  • the memory 301 can be used to store the computer-readable instructions 303, and the processor 302 executes or executes the computer-readable instructions or modules stored in the memory 301 and calls data stored in the memory 301 to implement Various functions of the computer device 30.
  • the memory 301 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.); the storage data area may Data (such as audio data, etc.) created according to the use of the computer device 30 and the like are stored.
  • the memory 301 may include a non-volatile memory or/and a volatile memory.
  • the non-volatile memory may include, for example, a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), and a secure digital ( Secure Digital, SD card, Flash Card, at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • the integrated module of the computer device 30 may be stored in a computer-readable storage medium.
  • this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium.
  • the computer-readable instruction when executed by the processor, it can implement the steps of the foregoing method embodiments.
  • the computer-readable instruction includes computer-readable instruction code
  • the computer-readable instruction code may be in the form of source code, object code, executable file, or some intermediate form.
  • the computer-readable medium may include: any entity or device capable of carrying the computer-readable instruction code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory).
  • modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical modules, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional modules in the various embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware, or in the form of hardware plus software functional modules.
  • the above-mentioned integrated modules implemented in the form of software functional modules may be stored in a computer-readable storage medium.
  • the above-mentioned software function module is stored in a readable storage medium and includes several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor to execute the various embodiments of the present application Part of the method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

L'invention concerne un procédé de reconnaissance de médicaments basé sur un apprentissage automatique et un dispositif associé, ledit procédé consistant à : en utilisant la séquence vectorielle de seconds échantillons de médicaments comme entrée, apprendre un modèle de reconnaissance de substances chimiques en fonction des étiquettes de substances chimiques des deuxièmes échantillons de médicaments ; extraire la séquence vectorielle de troisièmes échantillons de médicaments à l'aide d'un modèle de codage, puis utiliser la séquence vectorielle de troisièmes échantillons de médicaments comme entrée pour apprendre un modèle de reconnaissance de substances thérapeutiques en fonction des marqueurs de substances thérapeutiques des troisièmes échantillons de médicaments (105) ; utiliser le modèle de codage pour extraire la séquence vectorielle d'un énoncé de médicament à reconnaître, utiliser le modèle de reconnaissance de substances chimiques pour obtenir un ensemble d'entités de substances chimiques par reconnaissance de la séquence vectorielle de l'énoncé de médicament à reconnaître, puis utiliser le modèle de reconnaissance de substances thérapeutiques pour obtenir un ensemble d'entités de substances thérapeutiques par reconnaissance de la séquence vectorielle de l'énoncé de médicament à reconnaître (107) ; et déterminer qu'une entité de substance présente à la fois dans l'ensemble d'entités de substances chimiques et l'ensemble d'entités de substances thérapeutiques est un médicament (108). L'utilisation du procédé de l'invention permet d'améliorer l'efficacité et la précision de la reconnaissance de médicaments.
PCT/CN2020/093319 2020-03-04 2020-05-29 Procédé de reconnaissance de médicaments basé sur un apprentissage automatique, et dispositif associé WO2021174695A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010144271.X 2020-03-04
CN202010144271.XA CN111523316A (zh) 2020-03-04 2020-03-04 基于机器学习的药物识别方法及相关设备

Publications (1)

Publication Number Publication Date
WO2021174695A1 true WO2021174695A1 (fr) 2021-09-10

Family

ID=71901988

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/093319 WO2021174695A1 (fr) 2020-03-04 2020-05-29 Procédé de reconnaissance de médicaments basé sur un apprentissage automatique, et dispositif associé

Country Status (2)

Country Link
CN (1) CN111523316A (fr)
WO (1) WO2021174695A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114048148A (zh) * 2022-01-13 2022-02-15 广东拓思软件科学园有限公司 一种众包测试报告推荐方法、装置及电子设备
CN114420309A (zh) * 2021-09-13 2022-04-29 北京百度网讯科技有限公司 建立药物协同作用预测模型的方法、预测方法及对应装置

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112016309B (zh) * 2020-09-04 2024-03-08 平安科技(深圳)有限公司 抽取药物组合方法、设备、装置及存储介质
WO2022088043A1 (fr) 2020-10-30 2022-05-05 京东方科技集团股份有限公司 Procédé et appareil de reconnaissance d'image, procédé d'apprentissage, dispositif électronique et support de stockage
WO2022246691A1 (fr) * 2021-05-26 2022-12-01 深圳晶泰科技有限公司 Procédé et système de construction pour graphique de connaissances de formes cristallines de médicaments à petites molécules

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130066870A1 (en) * 2011-09-12 2013-03-14 Siemens Corporation System for Generating a Medical Knowledge Base
US20140278554A1 (en) * 2013-03-14 2014-09-18 Koninklijke Philips N.V. Using image references in radiology reports to support report-to-image navigation
CN106708959A (zh) * 2016-11-30 2017-05-24 重庆大学 一种基于医学文献数据库的组合药物识别与排序方法
CN106919794A (zh) * 2017-02-24 2017-07-04 黑龙江特士信息技术有限公司 面向多数据源的药品类实体识别方法及装置
CN108932342A (zh) * 2018-07-18 2018-12-04 腾讯科技(深圳)有限公司 一种语义匹配的方法、模型的学习方法及服务器
CN109829156A (zh) * 2019-01-18 2019-05-31 北京惠每云科技有限公司 医学文本识别方法及装置
CN109871545A (zh) * 2019-04-22 2019-06-11 京东方科技集团股份有限公司 命名实体识别方法及装置
CN110263167A (zh) * 2019-06-20 2019-09-20 北京百度网讯科技有限公司 医疗实体分类模型生成方法、装置、设备和可读存储介质
CN110598695A (zh) * 2019-08-15 2019-12-20 北京搜狗科技发展有限公司 一种药品识别方法、装置和用于药品识别的装置

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130066870A1 (en) * 2011-09-12 2013-03-14 Siemens Corporation System for Generating a Medical Knowledge Base
US20140278554A1 (en) * 2013-03-14 2014-09-18 Koninklijke Philips N.V. Using image references in radiology reports to support report-to-image navigation
CN106708959A (zh) * 2016-11-30 2017-05-24 重庆大学 一种基于医学文献数据库的组合药物识别与排序方法
CN106919794A (zh) * 2017-02-24 2017-07-04 黑龙江特士信息技术有限公司 面向多数据源的药品类实体识别方法及装置
CN108932342A (zh) * 2018-07-18 2018-12-04 腾讯科技(深圳)有限公司 一种语义匹配的方法、模型的学习方法及服务器
CN109829156A (zh) * 2019-01-18 2019-05-31 北京惠每云科技有限公司 医学文本识别方法及装置
CN109871545A (zh) * 2019-04-22 2019-06-11 京东方科技集团股份有限公司 命名实体识别方法及装置
CN110263167A (zh) * 2019-06-20 2019-09-20 北京百度网讯科技有限公司 医疗实体分类模型生成方法、装置、设备和可读存储介质
CN110598695A (zh) * 2019-08-15 2019-12-20 北京搜狗科技发展有限公司 一种药品识别方法、装置和用于药品识别的装置

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114420309A (zh) * 2021-09-13 2022-04-29 北京百度网讯科技有限公司 建立药物协同作用预测模型的方法、预测方法及对应装置
CN114420309B (zh) * 2021-09-13 2023-11-21 北京百度网讯科技有限公司 建立药物协同作用预测模型的方法、预测方法及对应装置
CN114048148A (zh) * 2022-01-13 2022-02-15 广东拓思软件科学园有限公司 一种众包测试报告推荐方法、装置及电子设备

Also Published As

Publication number Publication date
CN111523316A (zh) 2020-08-11

Similar Documents

Publication Publication Date Title
WO2021174695A1 (fr) Procédé de reconnaissance de médicaments basé sur un apprentissage automatique, et dispositif associé
Ko et al. Neural sign language translation based on human keypoint estimation
Calvo-Zaragoza et al. End-to-end neural optical music recognition of monophonic scores
CN109086357B (zh) 基于变分自动编码器的情感分类方法、装置、设备及介质
CN106919793B (zh) 一种医疗大数据的数据标准化处理方法及装置
Abdallah et al. Attention-based fully gated CNN-BGRU for Russian handwritten text
WO2021134524A1 (fr) Procédé de traitement de données, appareil, dispositif électronique et support d'enregistrement
CN110032728A (zh) 疾病名称标准化的转换方法和装置
CN105184053A (zh) 一种中文医疗服务项目信息的自动编码方法及系统
WO2023040493A1 (fr) Détection d'événement
CN111581972A (zh) 文本中症状和部位对应关系识别方法、装置、设备及介质
CN112131881A (zh) 信息抽取方法及装置、电子设备、存储介质
CN111986793A (zh) 基于人工智能的导诊处理方法、装置、计算机设备及介质
CN116912847A (zh) 一种医学文本识别方法、装置、计算机设备及存储介质
CN116341519A (zh) 基于背景知识的事件因果关系抽取方法、装置及存储介质
Ibrayim et al. An effective method for detection and recognition of Uyghur texts in images with backgrounds
CN113297852B (zh) 一种医学实体词的识别方法和装置
CN111199801B (zh) 一种用于识别病历的疾病类型的模型的构建方法及应用
Kim et al. Multimedia vision for the visually impaired through 2d multiarray braille display
Idrees et al. Exploiting script similarities to compensate for the large amount of data in training tesseract lstm: Towards kurdish ocr
Najam et al. Analysis of recent deep learning techniques for Arabic handwritten-text OCR and post-OCR correction
CN112199954B (zh) 基于语音语义的疾病实体匹配方法、装置及计算机设备
Tan et al. A pipeline approach to context-aware handwritten text recognition
CN111831829B (zh) 一种面向开放域的实体关系抽取方法、装置及终端设备
CN116842944A (zh) 一种基于词增强的实体关系抽取方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20923181

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20923181

Country of ref document: EP

Kind code of ref document: A1