WO2021174695A1 - 基于机器学习的药物识别方法及相关设备 - Google Patents

基于机器学习的药物识别方法及相关设备 Download PDF

Info

Publication number
WO2021174695A1
WO2021174695A1 PCT/CN2020/093319 CN2020093319W WO2021174695A1 WO 2021174695 A1 WO2021174695 A1 WO 2021174695A1 CN 2020093319 W CN2020093319 W CN 2020093319W WO 2021174695 A1 WO2021174695 A1 WO 2021174695A1
Authority
WO
WIPO (PCT)
Prior art keywords
drug
sentence
medicine
sample
vector sequence
Prior art date
Application number
PCT/CN2020/093319
Other languages
English (en)
French (fr)
Inventor
顾大中
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021174695A1 publication Critical patent/WO2021174695A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition

Definitions

  • This application relates to the field of artificial intelligence entity recognition technology, and in particular to a method, device, computer equipment, and computer-readable storage medium for drug recognition based on machine learning.
  • the first aspect of the present application provides a method for drug identification based on machine learning, the method including:
  • each first drug sentence sample in the first drug sentence sample set includes a missing word and a missing word label
  • each second drug sentence sample in the second drug sentence sample set contains a chemical substance label
  • each third drug sentence sample set in the third drug sentence sample set contains a therapeutic substance label
  • the coding model is used to extract the vector sequence of the drug sentence to be recognized, the chemical substance recognition model is used to obtain the chemical substance entity set by recognizing the vector sequence of the drug sentence to be recognized, and the therapeutic substance recognition model is used to identify the State the vector sequence of the drug sentence to be identified to obtain the therapeutic substance entity set;
  • the substance entities existing in both the chemical substance entity set and the therapeutic substance entity set are identified as drugs.
  • a second aspect of the present application provides a medicine identification device based on machine learning, the device comprising:
  • the first acquisition module is configured to acquire a first drug sentence sample set, where each first drug sentence sample in the first drug sentence sample set contains a missing word and a missing word label;
  • the first training module is used to train a coding model with the first sample set of medicine sentences
  • the second acquisition module is configured to acquire a second drug sentence sample set and a third drug sentence sample set, each second drug sentence sample in the second drug sentence sample set contains a chemical substance label, and the third drug sentence sample set Each third drug sentence sample in the sample set contains a therapeutic substance label;
  • the second training module is used to extract the vector sequence of the second drug sentence sample using the coding model, and use the vector sequence of the second drug sample as input to train according to the chemical substance label of the second drug sample Chemical substance identification model;
  • the third training module is used to extract the vector sequence of the third drug sentence sample using the coding model, and use the vector sequence of the third drug sample as input to train according to the therapeutic substance label of the third drug sample Therapeutic substance identification model;
  • the third acquisition module is used to acquire the drug sentence to be identified
  • the first recognition module is used to extract the vector sequence of the drug sentence to be recognized using the coding model, and to obtain the chemical substance entity set by recognizing the vector sequence of the drug sentence to be recognized by the chemical substance recognition model, and use the
  • the therapeutic substance recognition model obtains the therapeutic substance entity set by recognizing the vector sequence of the drug sentence to be recognized;
  • the second identification module is used to identify the substance entities that exist in both the chemical substance entity set and the therapeutic substance entity set as drugs.
  • a third aspect of the present application provides a computer device.
  • the computer device includes a memory and a processor, and the processor is configured to execute computer-readable instructions stored in the memory to implement the following steps:
  • each first drug sentence sample in the first drug sentence sample set includes a missing word and a missing word label
  • each second drug sentence sample in the second drug sentence sample set contains a chemical substance label
  • each third drug sentence sample set in the third drug sentence sample set contains a therapeutic substance label
  • the coding model is used to extract the vector sequence of the drug sentence to be recognized, the chemical substance recognition model is used to obtain the chemical substance entity set by recognizing the vector sequence of the drug sentence to be recognized, and the therapeutic substance recognition model is used to identify the State the vector sequence of the drug sentence to be identified to obtain the therapeutic substance entity set;
  • the substance entities existing in both the chemical substance entity set and the therapeutic substance entity set are identified as drugs.
  • the fourth aspect of the present application provides one or more readable storage media storing computer readable instructions.
  • the computer readable instructions When executed by one or more processors, the one or more processors execute the following step:
  • each first drug sentence sample in the first drug sentence sample set includes a missing word and a missing word label
  • each second drug sentence sample in the second drug sentence sample set contains a chemical substance label
  • each third drug sentence sample set in the third drug sentence sample set contains a therapeutic substance label
  • the coding model is used to extract the vector sequence of the drug sentence to be recognized, the chemical substance recognition model is used to obtain the chemical substance entity set by recognizing the vector sequence of the drug sentence to be recognized, and the therapeutic substance recognition model is used to identify the State the vector sequence of the drug sentence to be identified to obtain the therapeutic substance entity set;
  • the substance entities existing in both the chemical substance entity set and the therapeutic substance entity set are identified as drugs.
  • This application uses the coding model to extract the vector sequence of the second drug sentence sample and the vector sequence of the third drug sentence sample to improve the efficiency of training the chemical substance recognition model and the therapeutic substance recognition model, respectively. Recognizing the chemical substance entity in the drug sentence to be recognized is more stable than recognizing the therapeutic substance entity in the drug sentence to be recognized, and recognizing the substance entity existing in both the chemical substance entity set and the therapeutic substance entity set as a drug , Which reduces the risk of misidentifying drugs. Therefore, the present application realizes the identification of the medicine in the medicine sentence to be identified, and improves the efficiency and accuracy of medicine identification.
  • the details of one or more embodiments of the present application are presented in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings and claims.
  • Fig. 1 is a flowchart of a method for medicine identification based on machine learning provided by an embodiment of the present application.
  • Fig. 2 is a structural diagram of a medicine identification device based on machine learning provided by an embodiment of the present application.
  • Fig. 3 is a schematic diagram of a computer device provided by an embodiment of the present application.
  • the machine learning-based drug identification method of the present application is applied in one or more computer devices.
  • the computer device is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
  • Its hardware includes, but is not limited to, a microprocessor and an application specific integrated circuit (ASIC) , Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC application specific integrated circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Processor
  • embedded equipment etc.
  • the computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.
  • FIG. 1 is a flowchart of a method for medicine identification based on machine learning provided in Embodiment 1 of the present application.
  • the medicine identification method based on machine learning is applied to a computer device for identifying medicines in medicine sentences to be identified.
  • the machine learning-based drug identification method includes:
  • each first drug sentence sample in the first drug sentence sample set includes a missing word and a missing word label.
  • the obtaining a sample set of a first medicine sentence includes:
  • an optical scanner or a digital camera can be used to obtain a book image of a paper version of a medical book; the book image can be binarized, and the book image can be converted into a black and white image by setting a preset binarization threshold; Preprocessing such as noise and tilt correction; perform text recognition based on neural network or distance on the preprocessed black and white image.
  • a web crawler can be used to crawl electronic medical documents from Chinese journal literature databases (such as Wusing, HowNet) or Baidu Baike with keywords “ingredients”, “Chinese medicine” (or Chinese medicine names), etc.
  • the scanned paper version of the medical book and the grabbed electronic version of the medical document can be segmented, and the sentence can be deduplicated to obtain multiple drug sentences.
  • a drug sentence is "Metformin is a white powder”
  • a word in the drug sentence is randomly selected as the missing word
  • a drug sentence with the missing word is obtained " ⁇ S>Metformin is a white powder ⁇ mask> ⁇ E>”; where " ⁇ S>” indicates the head word of the drug sentence, " ⁇ E>” indicates the end word of the drug sentence, ⁇ mask> indicates the missing word of the drug sentence, and "powder” is The missing word tag of the drug sentence.
  • the coding model may be a BERT model or a word embedding model.
  • training the coding model by using the first sample set of medicine sentences includes:
  • the BERT model and the preset fully connected layer are optimized according to the missing word vector and label of the first medical sentence sample.
  • said generating the input vector sequence of each first medicine sentence sample may include:
  • the words "metformin is a white powder" contained in the first drug sentence sample are obtained.
  • the Stanford word segmentation tool can be used to segment the first drug sentence sample, or the method based on statistics and string matching can be used to segment the first drug sentence sample.
  • the preset word encoding table may adopt one-hot, word2vec, etc. encoding methods, and the encoding vector of each word corresponds to the word one-to-one.
  • the coding vector of a word of a first medicine sentence sample is a 10-dimensional vector
  • the position vector is a 2-dimensional vector
  • the coding input vector of the word is a 12-dimensional concatenation of the word’s coding vector and the word’s position vector. vector.
  • each second drug sentence sample in the second drug sentence sample set contains a chemical substance label
  • each of the third drug sentence sample sets contains a therapeutic substance label.
  • a second drug sentence sample in the second drug sentence sample set is "Metformin is a white water-soluble powder", and the label of the second drug sentence sample is "BH IH IH IH O O O O O O O O O O BH O O O".
  • a third drug sentence sample in the third drug sentence sample set is "Metformin is often used as the main drug in the treatment of diabetes", and the label of the third drug sentence sample is "O O O O O O BZ IZ IZ IZ O O O O O O O O”.
  • "O” is a non-named entity
  • BH is the starting label of the chemical substance label
  • IH is the middle label of the chemical substance label
  • BZ is the starting label of the therapeutic substance label
  • IZ is the therapeutic substance label.
  • the middle label of the label is the starting label of the chemical substance label.
  • the chemical substance identification model includes:
  • the chemical substance recognition model is composed of a long and short-term memory network and a conditional random field connected to the long- and short-term memory network;
  • the coding model can be used to extract the vector sequence of the second drug sentence sample, and the long and short-term memory network can be used to extract the second drug
  • the context and semantic features of the sentence sample are used to obtain the intermediate vector sequence of the second drug sentence sample;
  • the intermediate vector sequence is used as input and the conditional random field is used to output the chemical substance prediction label of the second drug sentence sample, which is optimized according to the chemical substance label and the chemical substance prediction label Parameters of long and short-term memory networks and conditional random fields.
  • the therapeutic substance identification model includes:
  • the therapeutic substance identification model is composed of a bidirectional long and short-term memory network and a conditional random field connected to the bidirectional long and short-term memory network;
  • the coding model can be used to extract the vector sequence of the third drug sentence sample, and the bidirectional long and short-term memory network can be used to extract the vector sequence
  • the context and semantic features of the third drug sentence sample are used to obtain the intermediate vector sequence of the third drug sentence sample; the intermediate vector sequence is used as input and the conditional random field is used to output the therapeutic substance prediction label of the third drug sentence sample according to the therapeutic substance label and the therapeutic substance
  • the predicted label optimizes the parameters of the bidirectional long-term short-term memory network and the conditional random field.
  • the drug sentence to be identified can be "Metformin is a white water-soluble powder. Metformin is often used as the main drug in the treatment of diabetes, and the patient should strictly control the intake of glucose during the medication.”
  • the coding model is used to extract the vector sequence of the drug sentence to be recognized, and the chemical substance entity set obtained by recognizing the vector sequence of the drug sentence to be recognized by the trained chemical substance recognition model is ⁇ metformin, metformin, glucose ⁇ , using
  • the therapeutic substance entity set obtained by the therapeutic substance recognition model by recognizing the vector sequence of the drug sentence to be recognized is ⁇ metformin ⁇ .
  • the chemical substance entity set is ⁇ metformin, metformin, glucose ⁇ and the therapeutic substance entity set is ⁇ metformin ⁇ , then the substance entity "metformin" that exists in both the chemical substance entity set and the therapeutic substance entity set is identified as a drug.
  • the medicine identification method based on machine learning of the first embodiment acquires a first medicine sentence sample set, and each first medicine sentence sample in the first medicine sentence sample set contains a missing word and a missing word label; using the first medicine sentence sample set;
  • the coding model is trained on the drug sentence sample set; the second drug sentence sample set and the third drug sentence sample set are acquired, each second drug sentence sample in the second drug sentence sample set contains a chemical substance label, and the third drug
  • Each third drug sentence sample in the sentence sample set contains a therapeutic substance label;
  • the coding model is used to extract the vector sequence of the second drug sentence sample, and the vector sequence of the second drug sample is used as the input to determine the
  • the chemical substance label of the second drug sample trains a chemical substance identification model; the coding model is used to extract the vector sequence of the third drug sentence sample, and the vector sequence of the third drug sample is used as input to determine the 3.
  • the therapeutic substance label of the drug sample trains the therapeutic substance recognition model; obtains the drug sentence to be recognized; uses the coding model to extract the vector sequence of the drug sentence to be recognized, and uses the chemical substance recognition model to identify the drug sentence to be recognized To obtain a chemical substance entity set, and use the therapeutic substance recognition model to obtain a therapeutic substance entity set by recognizing the vector sequence of the drug sentence to be recognized; both the chemical substance entity set and the therapeutic substance entity set exist
  • the physical entity is identified as a drug.
  • Using the coding model to extract the vector sequence of the second drug sentence sample and the vector sequence of the third drug sentence sample improves the efficiency of training the chemical substance recognition model and the therapeutic substance recognition model, respectively.
  • Recognizing the chemical substance entity in the drug sentence to be recognized is more stable than recognizing the therapeutic substance entity in the drug sentence to be recognized, and recognizing the substance entity existing in both the chemical substance entity set and the therapeutic substance entity set as a drug , which reduces the risk of misidentifying drugs. Therefore, in the first embodiment, the medicine in the medicine sentence to be identified is recognized, which improves the efficiency and accuracy of medicine identification.
  • the method further includes:
  • Outputting the material entities that are concentrated in the chemical substance entities and not in the treatment material entities can avoid misidentification, and send an identification reminder to the user, and receive the user's determination result.
  • the method further includes:
  • Two drugs appearing in a drug sentence can be connected in the knowledge graph to reflect the connection between drugs.
  • Fig. 2 is a structural diagram of a medicine identification device based on machine learning provided in the second embodiment of the present application.
  • the medicine identification device 20 based on machine learning is applied to a computer device.
  • the medicine identification device 20 based on machine learning is used to identify the medicine in the medicine sentence to be identified.
  • the machine learning-based drug identification device 20 may include a first acquisition module 201, a first training module 202, a second acquisition module 203, a second training module 204, a third training module 205, and a third The acquisition module 206, the first identification module 207, and the second identification module 208.
  • the first obtaining module 201 is configured to obtain a first drug sentence sample set, and each first drug sentence sample in the first drug sentence sample set includes a missing word and a missing word label.
  • the obtaining a sample set of a first medicine sentence includes:
  • an optical scanner or a digital camera can be used to obtain a book image of a paper version of a medical book; the book image can be binarized, and the book image can be converted into a black and white image by setting a preset binarization threshold; Preprocessing such as noise and tilt correction; perform text recognition based on neural network or distance on the preprocessed black and white image.
  • a web crawler can be used to crawl electronic medical documents from Chinese journal literature databases (such as Wusing, HowNet) or Baidu Baike with keywords “ingredients”, “Chinese medicine” (or Chinese medicine names), etc.
  • the scanned paper version of the medical book and the grabbed electronic version of the medical document can be segmented, and the sentence can be deduplicated to obtain multiple drug sentences.
  • a drug sentence is "Metformin is a white powder”
  • a word in the drug sentence is randomly selected as the missing word
  • a drug sentence with the missing word is obtained " ⁇ S>Metformin is a white powder ⁇ mask> ⁇ E>”; where " ⁇ S>” indicates the head word of the drug sentence, " ⁇ E>” indicates the end word of the drug sentence, ⁇ mask> indicates the missing word of the drug sentence, and "powder” is The missing word tag of the drug sentence.
  • the first training module 202 is used to train the coding model with the first sample set of medicine sentences.
  • the coding model may be a BERT model or a word embedding model.
  • training the coding model by using the first sample set of medicine sentences includes:
  • the BERT model and the preset fully connected layer are optimized according to the missing word vector and label of the first medical sentence sample.
  • said generating the input vector sequence of each first medicine sentence sample may include:
  • the words "metformin is a white powder" contained in the first drug sentence sample are obtained.
  • the Stanford word segmentation tool can be used to segment the first drug sentence sample, or the method based on statistics and string matching can be used to segment the first drug sentence sample.
  • the preset word encoding table may adopt one-hot, word2vec, etc. encoding methods, and the encoding vector of each word corresponds to the word one-to-one.
  • the coding vector of a word of a first medicine sentence sample is a 10-dimensional vector
  • the position vector is a 2-dimensional vector
  • the coding input vector of the word is a 12-dimensional concatenation of the word’s coding vector and the word’s position vector. vector.
  • the second acquisition module 203 is configured to acquire a second medicine sentence sample set and a third medicine sentence sample set, each second medicine sentence sample in the second medicine sentence sample set contains a chemical substance label, and the third medicine Each third drug sentence sample in the sentence sample set contains a therapeutic substance label.
  • a second drug sentence sample in the second drug sentence sample set is "Metformin is a white water-soluble powder", and the label of the second drug sentence sample is "BH IH IH IH O O O O O O O O O O BH O O O".
  • a third drug sentence sample in the third drug sentence sample set is "Metformin is often used as the main drug in the treatment of diabetes", and the label of the third drug sentence sample is "O O O O O O BZ IZ IZ IZ O O O O O O O O”.
  • "O” is a non-named entity
  • BH is the starting label of the chemical substance label
  • IH is the middle label of the chemical substance label
  • BZ is the starting label of the therapeutic substance label
  • IZ is the therapeutic substance label.
  • the middle label of the label is the starting label of the chemical substance label.
  • the second training module 204 is configured to use the coding model to extract the vector sequence of the second drug sentence sample, and use the vector sequence of the second drug sample as input to determine the chemical substance label of the second drug sample Train the chemical substance recognition model.
  • the chemical substance identification model includes:
  • the chemical substance recognition model is composed of a long and short-term memory network and a conditional random field connected to the long- and short-term memory network;
  • the coding model can be used to extract the vector sequence of the second drug sentence sample, and the long and short-term memory network can be used to extract the second drug
  • the context and semantic features of the sentence sample are used to obtain the intermediate vector sequence of the second drug sentence sample;
  • the intermediate vector sequence is used as input and the conditional random field is used to output the chemical substance prediction label of the second drug sentence sample, which is optimized according to the chemical substance label and the chemical substance prediction label Parameters of long and short-term memory networks and conditional random fields.
  • the third training module 205 is configured to use the coding model to extract the vector sequence of the third drug sentence sample, and use the vector sequence of the third drug sample as input to use the therapeutic substance label of the third drug sample as input. Train the therapeutic substance recognition model.
  • the therapeutic substance identification model includes:
  • the therapeutic substance identification model is composed of a bidirectional long and short-term memory network and a conditional random field connected to the bidirectional long and short-term memory network;
  • the coding model can be used to extract the vector sequence of the third drug sentence sample, and the bidirectional long and short-term memory network can be used to extract the vector sequence
  • the context and semantic features of the third drug sentence sample are used to obtain the intermediate vector sequence of the third drug sentence sample; the intermediate vector sequence is used as input and the conditional random field is used to output the therapeutic substance prediction label of the third drug sentence sample according to the therapeutic substance label and the therapeutic substance
  • the predicted label optimizes the parameters of the bidirectional long-term short-term memory network and the conditional random field.
  • the third acquiring module 206 is used to acquire the medicine sentence to be identified.
  • the drug sentence to be identified can be "Metformin is a white water-soluble powder. Metformin is often used as the main drug in the treatment of diabetes, and the patient should strictly control the intake of glucose during the medication.”
  • the first recognition module 207 is used to extract the vector sequence of the drug sentence to be recognized using the coding model, and to obtain the chemical substance entity set by recognizing the vector sequence of the drug sentence to be recognized by the chemical substance recognition model, and
  • the therapeutic substance recognition model obtains the therapeutic substance entity set by recognizing the vector sequence of the drug sentence to be recognized.
  • the coding model is used to extract the vector sequence of the drug sentence to be recognized, and the chemical substance entity set obtained by recognizing the vector sequence of the drug sentence to be recognized by the trained chemical substance recognition model is ⁇ metformin, metformin, glucose ⁇ , using
  • the therapeutic substance entity set obtained by the therapeutic substance recognition model by recognizing the vector sequence of the drug sentence to be recognized is ⁇ metformin ⁇ .
  • the second identification module 208 is configured to identify the substance entities that exist in both the chemical substance entity set and the therapeutic substance entity set as drugs.
  • the chemical substance entity set is ⁇ metformin, metformin, glucose ⁇ and the therapeutic substance entity set is ⁇ metformin ⁇ , then the substance entity "metformin" that exists in both the chemical substance entity set and the therapeutic substance entity set is identified as a drug.
  • the machine learning-based medicine recognition device 20 of the second embodiment acquires a first medicine sentence sample set, and each first medicine sentence sample in the first medicine sentence sample set contains a missing word and a missing word label; using the first medicine sentence sample set; A sample set of drug sentences to train the coding model; a second sample set of drug sentences and a third sample set of drug sentences are acquired, each second sample of drug sentences in the second sample set of drug sentences contains a chemical substance label, and the third Each third drug sentence sample in the drug sentence sample set contains a therapeutic substance label; the coding model is used to extract the vector sequence of the second drug sentence sample, and the vector sequence of the second drug sample is used as the input, according to The chemical substance label of the second drug sample trains a chemical substance recognition model; the coding model is used to extract the vector sequence of the third drug sentence sample, and the vector sequence of the third drug sample is used as input to determine the The therapeutic substance label of the third drug sample trains the therapeutic substance recognition model; obtains the drug sentence to be recognized; extracts the vector sequence of the drug sentence to be recognized
  • Using the coding model to extract the vector sequence of the second drug sentence sample and the vector sequence of the third drug sentence sample improves the efficiency of training the chemical substance recognition model and the therapeutic substance recognition model, respectively. Recognizing the chemical substance entity in the drug sentence to be recognized is more stable than recognizing the therapeutic substance entity in the drug sentence to be recognized, and recognizing the substance entity existing in both the chemical substance entity set and the therapeutic substance entity set as a drug , which reduces the risk of misidentifying drugs. Therefore, in the second embodiment, the medicine in the medicine sentence to be identified is identified, which improves the efficiency and accuracy of medicine identification.
  • the machine learning-based drug identification device 20 further includes: a sending module, configured to output the substance entities that are concentrated in the chemical substance entities and that do not exist in the therapeutic substance entities; and send an identification reminder.
  • the device 20 for recognizing drugs based on machine learning may further include a building module for constructing a drug knowledge graph with the recognized drugs.
  • This embodiment provides one or more readable storage media storing computer readable instructions.
  • the readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage media;
  • the steps in the above-mentioned embodiment of the machine learning-based medicine identification method are implemented, for example, steps 101-108 shown in FIG. 1:
  • each first drug sentence sample in the first drug sentence sample set includes a missing word and a missing word label
  • each second drug sentence sample in the second drug sentence sample set contains a chemical substance label
  • each of the third drug sentence sample sets contains a therapeutic substance label
  • each module in the above-mentioned device embodiment is realized, for example, the modules 201-208 in FIG. 2:
  • the first obtaining module 201 is configured to obtain a first drug sentence sample set, where each first drug sentence sample in the first drug sentence sample set includes a missing word and a missing word label;
  • the first training module 202 is configured to train a coding model by using the first sample set of medicine sentences
  • the second acquisition module 203 is configured to acquire a second medicine sentence sample set and a third medicine sentence sample set, each second medicine sentence sample in the second medicine sentence sample set contains a chemical substance label, and the third medicine Each third drug sentence sample in the sentence sample set contains a therapeutic substance label;
  • the second training module 204 is configured to use the coding model to extract the vector sequence of the second drug sentence sample, and use the vector sequence of the second drug sample as input to determine the chemical substance label of the second drug sample Training chemical substance recognition model;
  • the third training module 205 is configured to use the coding model to extract the vector sequence of the third drug sentence sample, and use the vector sequence of the third drug sample as input to use the therapeutic substance label of the third drug sample as input. Training the therapeutic substance recognition model;
  • the third obtaining module 206 is used to obtain the drug sentence to be identified
  • the first recognition module 207 is used to extract the vector sequence of the drug sentence to be recognized using the coding model, and to obtain the chemical substance entity set by recognizing the vector sequence of the drug sentence to be recognized by the chemical substance recognition model, and
  • the therapeutic substance recognition model obtains the therapeutic substance entity set by recognizing the vector sequence of the drug sentence to be recognized;
  • the second identification module 208 is configured to identify the substance entities that exist in both the chemical substance entity set and the therapeutic substance entity set as drugs.
  • FIG. 3 is a schematic diagram of a computer device provided in Embodiment 3 of this application.
  • the computer device 30 includes a memory 301, a processor 302, and computer-readable instructions 303 stored in the memory 301 and running on the processor 302, such as a medicine recognition program based on machine learning.
  • the processor 302 executes the computer-readable instruction 303, the steps in the embodiment of the above-mentioned machine learning-based medicine identification method are implemented, for example, 101-108 shown in FIG. 1:
  • each first drug sentence sample in the first drug sentence sample set includes a missing word and a missing word label
  • each second drug sentence sample in the second drug sentence sample set contains a chemical substance label
  • each of the third drug sentence sample sets contains a therapeutic substance label
  • each module in the above-mentioned device embodiment is realized, for example, the modules 201-208 in FIG. 2:
  • the first obtaining module 201 is configured to obtain a first drug sentence sample set, where each first drug sentence sample in the first drug sentence sample set includes a missing word and a missing word label;
  • the first training module 202 is configured to train a coding model by using the first sample set of medicine sentences
  • the second acquisition module 203 is configured to acquire a second medicine sentence sample set and a third medicine sentence sample set, each second medicine sentence sample in the second medicine sentence sample set contains a chemical substance label, and the third medicine Each third drug sentence sample in the sentence sample set contains a therapeutic substance label;
  • the second training module 204 is configured to use the coding model to extract the vector sequence of the second drug sentence sample, and use the vector sequence of the second drug sample as input to determine the chemical substance label of the second drug sample Training chemical substance recognition model;
  • the third training module 205 is configured to use the coding model to extract the vector sequence of the third drug sentence sample, and use the vector sequence of the third drug sample as input to use the therapeutic substance label of the third drug sample as input. Training the therapeutic substance recognition model;
  • the third obtaining module 206 is used to obtain the drug sentence to be identified
  • the first recognition module 207 is used to extract the vector sequence of the drug sentence to be recognized using the coding model, and to obtain the chemical substance entity set by recognizing the vector sequence of the drug sentence to be recognized by the chemical substance recognition model, and
  • the therapeutic substance recognition model obtains the therapeutic substance entity set by recognizing the vector sequence of the drug sentence to be recognized;
  • the second identification module 208 is configured to identify the substance entities that exist in both the chemical substance entity set and the therapeutic substance entity set as drugs.
  • the computer-readable instruction 303 may be divided into one or more modules, and the one or more modules are stored in the memory 301 and executed by the processor 302 to complete the method.
  • the one or more modules may be a series of computer-readable instruction instruction segments capable of completing specific functions, and the instruction segment is used to describe the execution process of the computer-readable instruction 303 in the computer device 30.
  • the computer-readable instruction 303 can be divided into the first acquisition module 201, the first training module 202, the second acquisition module 203, the second training module 204, the third training module 205, and the third acquisition module in FIG.
  • the module 206, the first identification module 207, and the second identification module 208 refer to the second embodiment for the specific functions of each module.
  • the schematic diagram 3 is only an example of the computer device 30, and does not constitute a limitation on the computer device 30. It may include more or less components than those shown in the figure, or combine certain components, or different components.
  • the computer device 30 may also include input and output devices, network access devices, buses, and so on.
  • the so-called processor 302 may be a central processing unit (Central Processing Unit, CPU), other general processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor 302 can also be any conventional processor, etc.
  • the processor 302 is the control center of the computer device 30, which uses various interfaces and lines to connect the entire computer device 30. Various parts.
  • the memory 301 can be used to store the computer-readable instructions 303, and the processor 302 executes or executes the computer-readable instructions or modules stored in the memory 301 and calls data stored in the memory 301 to implement Various functions of the computer device 30.
  • the memory 301 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.); the storage data area may Data (such as audio data, etc.) created according to the use of the computer device 30 and the like are stored.
  • the memory 301 may include a non-volatile memory or/and a volatile memory.
  • the non-volatile memory may include, for example, a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), and a secure digital ( Secure Digital, SD card, Flash Card, at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • the integrated module of the computer device 30 may be stored in a computer-readable storage medium.
  • this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium.
  • the computer-readable instruction when executed by the processor, it can implement the steps of the foregoing method embodiments.
  • the computer-readable instruction includes computer-readable instruction code
  • the computer-readable instruction code may be in the form of source code, object code, executable file, or some intermediate form.
  • the computer-readable medium may include: any entity or device capable of carrying the computer-readable instruction code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory).
  • modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical modules, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional modules in the various embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware, or in the form of hardware plus software functional modules.
  • the above-mentioned integrated modules implemented in the form of software functional modules may be stored in a computer-readable storage medium.
  • the above-mentioned software function module is stored in a readable storage medium and includes several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor to execute the various embodiments of the present application Part of the method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

一种基于机器学习的药物识别方法及相关设备,该方法以第二药物样本的向量序列为输入,以根据第二药物样本的化学物质标签训练化学物质识别模型;用编码模型提取第三药物语句样本的向量序列,以第三药物样本的向量序列为输入,以根据第三药物样本的治疗物质标签训练治疗物质识别模型(105);用编码模型提取待识别药物语句的向量序列,化学物质识别模型通过识别待识别药物语句的向量序列得到化学物质实体集,治疗物质识别模型通过识别待识别药物语句的向量序列得到治疗物质实体集(107);将化学物质实体集和治疗物质实体集中都存在的物质实体确定为药物(108)。该方法提升了药物识别的效率和准确率。

Description

基于机器学习的药物识别方法及相关设备
本申请要求于2020年3月4日提交中国专利局、申请号为202010144271.X,申请名称为“基于机器学习的药物识别方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能的实体识别技术领域,具体涉及一种基于机器学习的药物识别方法、装置、计算机设备及计算机可读存储介质。
背景技术
对于许多医学文本而言,提取出其中的药物名称,对于理解文本内容有巨大帮助。发明人意识到,为帮助相关从业者和研究人员快速、高效地获取医学文本中的药物名称,迫切需要对药物命名实体进行识别,从大量的医学文本中有效获取药物命名实体。
实际应用中,命名实体识别技术在药物命名实体识别领域仍是一片空白。目前对药物的整理工作还是通过人工方式,效率低且准确率不高。
申请内容
鉴于以上内容,有必要提出一种基于机器学习的药物识别方法、装置、计算机装置及计算机可读存储介质,其可以识别待识别药物语句中的药物。
本申请的第一方面提供一种基于机器学习的药物识别方法,所述方法包括:
获取第一药物语句样本集,所述第一药物语句样本集中的每个第一药物语句样本包含一个缺失词和一个缺失词标签;
用所述第一药物语句样本集训练编码模型;
获取第二药物语句样本集和第三药物语句样本集,所述第二药物语句样本集中的每个第二药物语句样本包含一个化学物质标签,所述第三药物语句样本集中的每个第三药物语句样本包含一个治疗物质标签;
用所述编码模型提取所述第二药物语句样本的向量序列,以所述第二药物样本的向量序列为输入,以根据所述第二药物样本的化学物质标签训练化学物质识别模型;
用所述编码模型提取所述第三药物语句样本的向量序列,以所述第三药物样本的向量序列为输入,以根据所述第三药物样本的治疗物质标签训练治疗物质识别模型;
获取待识别药物语句;
用所述编码模型提取所述待识别药物语句的向量序列,用所述化学物质识别模型通过识别所述待识别药物语句的向量序列得到化学物质实体集,用所述治疗物质识别模型通过识别所述待识别药物语句的向量序列得到治疗物质实体集;
将所述化学物质实体集和所述治疗物质实体集中都存在的物质实体识别为药物。
本申请的第二方面提供一种基于机器学习的药物识别装置,所述装置包括:
第一获取模块,用于获取第一药物语句样本集,所述第一药物语句样本集中的每个第一药物语句样本包含一个缺失词和一个缺失词标签;
第一训练模块,用于用所述第一药物语句样本集训练编码模型;
第二获取模块,用于获取第二药物语句样本集和第三药物语句样本集,所述第二药物语句样本集中的每个第二药物语句样本包含一个化学物质标签,所述第三药物语句样本集中的每个第三药物语句样本包含一个治疗物质标签;
第二训练模块,用于用所述编码模型提取所述第二药物语句样本的向量序列,以所 述第二药物样本的向量序列为输入,以根据所述第二药物样本的化学物质标签训练化学物质识别模型;
第三训练模块,用于用所述编码模型提取所述第三药物语句样本的向量序列,以所述第三药物样本的向量序列为输入,以根据所述第三药物样本的治疗物质标签训练治疗物质识别模型;
第三获取模块,用于获取待识别药物语句;
第一识别模块,用于用所述编码模型提取所述待识别药物语句的向量序列,用所述化学物质识别模型通过识别所述待识别药物语句的向量序列得到化学物质实体集,用所述治疗物质识别模型通过识别所述待识别药物语句的向量序列得到治疗物质实体集;
第二识别模块,用于将所述化学物质实体集和所述治疗物质实体集中都存在的物质实体识别为药物。
本申请的第三方面提供一种计算机设备,所述计算机装置包括存储器和处理器,所述处理器用于执行存储器中存储的计算机可读指令以实现如下步骤:
获取第一药物语句样本集,所述第一药物语句样本集中的每个第一药物语句样本包含一个缺失词和一个缺失词标签;
用所述第一药物语句样本集训练编码模型;
获取第二药物语句样本集和第三药物语句样本集,所述第二药物语句样本集中的每个第二药物语句样本包含一个化学物质标签,所述第三药物语句样本集中的每个第三药物语句样本包含一个治疗物质标签;
用所述编码模型提取所述第二药物语句样本的向量序列,以所述第二药物样本的向量序列为输入,以根据所述第二药物样本的化学物质标签训练化学物质识别模型;
用所述编码模型提取所述第三药物语句样本的向量序列,以所述第三药物样本的向量序列为输入,以根据所述第三药物样本的治疗物质标签训练治疗物质识别模型;
获取待识别药物语句;
用所述编码模型提取所述待识别药物语句的向量序列,用所述化学物质识别模型通过识别所述待识别药物语句的向量序列得到化学物质实体集,用所述治疗物质识别模型通过识别所述待识别药物语句的向量序列得到治疗物质实体集;
将所述化学物质实体集和所述治疗物质实体集中都存在的物质实体识别为药物。
本申请的第四方面提供一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
获取第一药物语句样本集,所述第一药物语句样本集中的每个第一药物语句样本包含一个缺失词和一个缺失词标签;
用所述第一药物语句样本集训练编码模型;
获取第二药物语句样本集和第三药物语句样本集,所述第二药物语句样本集中的每个第二药物语句样本包含一个化学物质标签,所述第三药物语句样本集中的每个第三药物语句样本包含一个治疗物质标签;
用所述编码模型提取所述第二药物语句样本的向量序列,以所述第二药物样本的向量序列为输入,以根据所述第二药物样本的化学物质标签训练化学物质识别模型;
用所述编码模型提取所述第三药物语句样本的向量序列,以所述第三药物样本的向量序列为输入,以根据所述第三药物样本的治疗物质标签训练治疗物质识别模型;
获取待识别药物语句;
用所述编码模型提取所述待识别药物语句的向量序列,用所述化学物质识别模型通过识别所述待识别药物语句的向量序列得到化学物质实体集,用所述治疗物质识别模型通过识别所述待识别药物语句的向量序列得到治疗物质实体集;
将所述化学物质实体集和所述治疗物质实体集中都存在的物质实体识别为药物。
本申请用所述编码模型提取所述第二药物语句样本的向量序列和所述第三药物语句 样本的向量序列分别提升了训练所述化学物质识别模型和所述治疗物质识别模型的效率。识别所述待识别药物语句中的化学物质实体比识别所述待识别药物语句中的治疗物质实体更稳定,将所述化学物质实体集和所述治疗物质实体集中都存在的物质实体识别为药物,降低了误识别药物的风险。因此,本申请实现了识别所述待识别药物语句中的药物,提升了药物识别的效率和准确率。本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。
附图说明
图1是本申请实施例提供的基于机器学习的药物识别方法的流程图。
图2是本申请实施例提供的基于机器学习的药物识别装置的结构图。
图3是本申请实施例提供的计算机装置的示意图。
具体实施方式
为了能够更清楚地理解本申请的上述目的、特征和优点,下面结合附图和具体实施例对本申请进行详细描述。需要说明的是,在不冲突的情况下,本申请的实施例及实施例中的特征可以相互组合。
在下面的描述中阐述了很多具体细节以便于充分理解本申请,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。
本申请涉及人工智能技术领域。优选地,本申请的基于机器学习的药物识别方法应用在一个或者多个计算机装置中。所述计算机装置是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
所述计算机装置可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机装置可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。
实施例一
图1是本申请实施例一提供的基于机器学习的药物识别方法的流程图。所述基于机器学习的药物识别方法应用于计算机装置,用于识别待识别药物语句中的药物。
如图1所示,所述基于机器学习的药物识别方法包括:
101,获取第一药物语句样本集,所述第一药物语句样本集中的每个第一药物语句样本包含一个缺失词和一个缺失词标签。
在一具体实施例中,所述获取第一药物语句样本集包括:
(1)通过光学字符识别(OCR,Optical Character Recognition)对纸质版医学书籍进行扫描识别。
例如,可以通过光学扫描仪或数码相机获取纸质版医学书籍的书籍图像;对书籍图像进行二值化,通过设置预设二值化阈值,将书籍图像转化为黑白图像;对黑白图像进行去噪和倾斜校正等预处理;对预处理后的黑白图像进行基于神经网络或距离的文字识别。
(2)利用网络爬虫从网络上抓取电子版医学文档。
例如,可以使用网页爬虫以关键词“成分”、“中药”(或中药名)等从中文期刊文 献数据库(如万方、知网)或百度百科进行电子版医学文档抓取。
(3)从扫描的纸质版医学书籍和抓取的电子版医学文档中提取多个药物语句。
例如,可以对扫描的纸质版医学书籍和抓取的电子版医学文档进行分句,并进行语句去重,得到多个药物语句。
(4)对所述多个药物语句进行清洗预处理。
例如,可以对提取的多个药物语句进行错别字校正、无关语句过滤等清洗预处理。
(5)确定所述多个药物语句中的每个药物语句的缺失词和缺失词标签。
例如,一个药物语句为“二甲双胍是一种白色粉末”,随机选择该药物语句中的一个词(如,粉末)为缺失词,得到确定缺失词的药物语句“<S>二甲双胍是一种白色<mask><E>”;其中“<S>”表示该药物语句的头部词,“<E>”表示该药物语句的尾部词,<mask>表示该药物语句的缺失词,“粉末”为该药物语句的缺失词标签。
102,用所述第一药物语句样本集训练编码模型。
在一具体实施例中,所述编码模型可以为BERT模型或词嵌入模型。
若所述编码模型为BERT模型,所述用所述第一药物语句样本集训练编码模型包括:
生成每个第一药物语句样本的输入向量序列;
以该第一药物语句样本的输入向量序列为输入,用所述BERT模型计算该第一药物语句样本的输出向量序列;
以该第一药物语句样本的输出向量序列为输入,用预设全连接层计算该第一药物语句样本的缺失词向量;
根据该第一药物语句样本的缺失词向量和标签优化所述BERT模型和所述预设全连接层。
在另一实施例中,所述生成每个第一药物语句样本的输入向量序列可以包括:
(1)对每个第一药物语句样本进行分词,得到每个第一药物语句样本包含的词语。
例如,对一个第一药物语句样本(二甲双胍是一种白色粉末)进行分词,得到该第一药物语句样本包含的词语“二甲双胍是一种白色粉末”。可以采用斯坦福分词工具对第一药物语句样本进行分词,也可以采用基于统计、基于字符串匹配的方法对第一药物语句样本进行分词。
(2)根据预设词语编码表获取每个第一药物语句样本的每个词语的编码向量。
所述预设词语编码表可以采用one-hot、word2vec等编码方式,每个词语的编码向量与该词语一一对应。
(3)根据每个第一药物语句样本的每个词语的位置编号生成该词语的位置向量。
例如,第一药物语句样本“二甲双胍是一种白色粉末”中,“二甲双胍”的位置编号为1,则该词语的位置向量为(0,1)。
(4)拼接每个第一药物语句样本的每个词语的编码向量和位置向量,得到该第一药物语句样本的每个词语的编码输入向量。
例如,一个第一药物语句样本的一个词语的编码向量为10维向量,位置向量为2维向量,则该词语的编码输入向量为该词语的编码向量和该词语的位置向量拼接组成的12维向量。
(5)依词序组合每个第一药物语句样本的每个词语的编码输入向量,得到该第一药物语句样本的输入向量序列。
103,获取第二药物语句样本集和第三药物语句样本集,所述第二药物语句样本集中的每个第二药物语句样本包含一个化学物质标签,所述第三药物语句样本集中的每个第三药物语句样本包含一个治疗物质标签。
例如,第二药物语句样本集中的一个第二药物语句样本为“二甲双胍是一种白色可溶于水的粉末”,该第二药物语句样本的标签为“B-H I-H I-H I-H O O O O O O O O B-H O O O”。第三药物语句样本集中的一个第三药物语句样本为“在糖尿 病治疗中经常使用二甲双胍作为主要药物”,该第三药物语句样本的标签为“O O O O O O O O O O O B-Z I-Z I-Z I-Z O O O O O O”。其中,“O”为非命名实体,“B-H”为化学物质标签的起始标签,“I-H”为化学物质标签的中间标签,“B-Z”为治疗物质标签的起始标签,“I-Z”治疗物质标签的中间标签。
104,用所述编码模型提取所述第二药物语句样本的向量序列,以所述第二药物样本的向量序列为输入,以根据所述第二药物样本的化学物质标签训练化学物质识别模型。
在一具体实施例中,所述化学物质识别模型包括:
基于长短期记忆网络和条件随机场的模型;或
基于双向长短期记忆网络和条件随机场的模型;或
基于BiGRU和条件随机场的模型。
例如,所述化学物质识别模型由长短期记忆网络和接于长短期记忆网络后的条件随机场组成;可以用编码模型提取第二药物语句样本的向量序列,用长短期记忆网络提取第二药物语句样本的上下文语义特征,得到第二药物语句样本的中间向量序列;以中间向量序列为输入用条件随机场输出第二药物语句样本的化学物质预测标签,根据化学物质标签和化学物质预测标签优化长短期记忆网络和条件随机场的参数。
105,用所述编码模型提取所述第三药物语句样本的向量序列,以所述第三药物样本的向量序列为输入,以根据所述第三药物样本的治疗物质标签训练治疗物质识别模型。
在一具体实施例中,所述治疗物质识别模型包括:
基于长短期记忆网络和条件随机场的模型;或
基于双向长短期记忆网络和条件随机场的模型;或
基于BiGRU和条件随机场的模型。
例如,所述治疗物质识别模型由双向长短期记忆网络和接于双向长短期记忆网络后的条件随机场组成;可以用编码模型提取第三药物语句样本的向量序列,用双向长短期记忆网络提取第三药物语句样本的上下文语义特征,得到第三药物语句样本的中间向量序列;以中间向量序列为输入用条件随机场输出第三药物语句样本的治疗物质预测标签,根据治疗物质标签和治疗物质预测标签优化双向长短期记忆网络和条件随机场的参数。
106,获取待识别药物语句。
例如,待识别药物语句可以为“二甲双胍是一种白色可溶于水的粉末,在糖尿病治疗中经常使用二甲双胍作为主要药物,在用药期间患者应严格控制葡萄糖的摄入”。
107,用所述编码模型提取所述待识别药物语句的向量序列,用所述化学物质识别模型通过识别所述待识别药物语句的向量序列得到化学物质实体集,用所述治疗物质识别模型通过识别所述待识别药物语句的向量序列得到治疗物质实体集。
例如,用编码模型提取上述待识别药物语句的向量序列,用训练好的化学物质识别模型通过识别上述述待识别药物语句的向量序列得到的化学物质实体集为{二甲双胍,二甲双胍,葡萄糖},用所述治疗物质识别模型通过识别上述待识别药物语句的向量序列得到的治疗物质实体集为{二甲双胍}。
108,将所述化学物质实体集和所述治疗物质实体集中都存在的物质实体识别为药物。
例如,化学物质实体集为{二甲双胍,二甲双胍,葡萄糖},治疗物质实体集为{二甲双胍},则将化学物质实体集和所述治疗物质实体集中都存在的物质实体“二甲双胍”识别为药物。
实施例一的基于机器学习的药物识别方法获取第一药物语句样本集,所述第一药物语句样本集中的每个第一药物语句样本包含一个缺失词和一个缺失词标签;用所述第一药物语句样本集训练编码模型;获取第二药物语句样本集和第三药物语句样本集,所述第二药物语句样本集中的每个第二药物语句样本包含一个化学物质标签,所述第三药物语句样本集中的每个第三药物语句样本包含一个治疗物质标签;用所述编码模型提取所 述第二药物语句样本的向量序列,以所述第二药物样本的向量序列为输入,以根据所述第二药物样本的化学物质标签训练化学物质识别模型;用所述编码模型提取所述第三药物语句样本的向量序列,以所述第三药物样本的向量序列为输入,以根据所述第三药物样本的治疗物质标签训练治疗物质识别模型;获取待识别药物语句;用所述编码模型提取所述待识别药物语句的向量序列,用所述化学物质识别模型通过识别所述待识别药物语句的向量序列得到化学物质实体集,用所述治疗物质识别模型通过识别所述待识别药物语句的向量序列得到治疗物质实体集;将所述化学物质实体集和所述治疗物质实体集中都存在的物质实体识别为药物。用所述编码模型提取所述第二药物语句样本的向量序列和所述第三药物语句样本的向量序列分别提升了训练所述化学物质识别模型和所述治疗物质识别模型的效率。识别所述待识别药物语句中的化学物质实体比识别所述待识别药物语句中的治疗物质实体更稳定,将所述化学物质实体集和所述治疗物质实体集中都存在的物质实体识别为药物,降低了误识别药物的风险。因此,实施例一识别待识别药物语句中的药物,提升了药物识别的效率和准确率。
在另一实施例中,所述方法还包括:
输出所述化学物质实体集中存在且所述治疗物质实体集中不存在的物质实体;
发送识别提醒。
输出所述化学物质实体集中存在且所述治疗物质实体集中不存在的物质实体,可以避免误识别,并发送识别提醒给用户,接收用户的判定结果。
在另一实施例中,所述方法还包括:
用识别出的药物构建药物知识图谱。
可以在知识图谱中连接在一个药物语句中出现的两个药物,以此体现药物间的联系。
实施例二
图2是本申请实施例二提供的基于机器学习的药物识别装置的结构图。所述基于机器学习的药物识别装置20应用于计算机装置。所述基于机器学习的药物识别装置20用于识别待识别药物语句中的药物。
如图2所示,所述基于机器学习的药物识别装置20可以包括第一获取模块201、第一训练模块202、第二获取模块203、第二训练模块204、第三训练模块205、第三获取模块206、第一识别模块207、第二识别模块208。
第一获取模块201,用于获取第一药物语句样本集,所述第一药物语句样本集中的每个第一药物语句样本包含一个缺失词和一个缺失词标签。
在一具体实施例中,所述获取第一药物语句样本集包括:
(1)通过光学字符识别(OCR,Optical Character Recognition)对纸质版医学书籍进行扫描识别。
例如,可以通过光学扫描仪或数码相机获取纸质版医学书籍的书籍图像;对书籍图像进行二值化,通过设置预设二值化阈值,将书籍图像转化为黑白图像;对黑白图像进行去噪和倾斜校正等预处理;对预处理后的黑白图像进行基于神经网络或距离的文字识别。
(2)利用网络爬虫从网络上抓取电子版医学文档。
例如,可以使用网页爬虫以关键词“成分”、“中药”(或中药名)等从中文期刊文献数据库(如万方、知网)或百度百科进行电子版医学文档抓取。
(3)从扫描的纸质版医学书籍和抓取的电子版医学文档中提取多个药物语句。
例如,可以对扫描的纸质版医学书籍和抓取的电子版医学文档进行分句,并进行语句去重,得到多个药物语句。
(4)对所述多个药物语句进行清洗预处理。
例如,可以对提取的多个药物语句进行错别字校正、无关语句过滤等清洗预处理。
(5)确定所述多个药物语句中的每个药物语句的缺失词和缺失词标签。
例如,一个药物语句为“二甲双胍是一种白色粉末”,随机选择该药物语句中的一个词(如,粉末)为缺失词,得到确定缺失词的药物语句“<S>二甲双胍是一种白色<mask><E>”;其中“<S>”表示该药物语句的头部词,“<E>”表示该药物语句的尾部词,<mask>表示该药物语句的缺失词,“粉末”为该药物语句的缺失词标签。
第一训练模块202,用于用所述第一药物语句样本集训练编码模型。
在一具体实施例中,所述编码模型可以为BERT模型或词嵌入模型。
若所述编码模型为BERT模型,所述用所述第一药物语句样本集训练编码模型包括:
生成每个第一药物语句样本的输入向量序列;
以该第一药物语句样本的输入向量序列为输入,用所述BERT模型计算该第一药物语句样本的输出向量序列;
以该第一药物语句样本的输出向量序列为输入,用预设全连接层计算该第一药物语句样本的缺失词向量;
根据该第一药物语句样本的缺失词向量和标签优化所述BERT模型和所述预设全连接层。
在另一实施例中,所述生成每个第一药物语句样本的输入向量序列可以包括:
(1)对每个第一药物语句样本进行分词,得到每个第一药物语句样本包含的词语。
例如,对一个第一药物语句样本(二甲双胍是一种白色粉末)进行分词,得到该第一药物语句样本包含的词语“二甲双胍是一种白色粉末”。可以采用斯坦福分词工具对第一药物语句样本进行分词,也可以采用基于统计、基于字符串匹配的方法对第一药物语句样本进行分词。
(2)根据预设词语编码表获取每个第一药物语句样本的每个词语的编码向量。
所述预设词语编码表可以采用one-hot、word2vec等编码方式,每个词语的编码向量与该词语一一对应。
(3)根据每个第一药物语句样本的每个词语的位置编号生成该词语的位置向量。
例如,第一药物语句样本“二甲双胍是一种白色粉末”中,“二甲双胍”的位置编号为1,则该词语的位置向量为(0,1)。
(4)拼接每个第一药物语句样本的每个词语的编码向量和位置向量,得到该第一药物语句样本的每个词语的编码输入向量。
例如,一个第一药物语句样本的一个词语的编码向量为10维向量,位置向量为2维向量,则该词语的编码输入向量为该词语的编码向量和该词语的位置向量拼接组成的12维向量。
(5)依词序组合每个第一药物语句样本的每个词语的编码输入向量,得到该第一药物语句样本的输入向量序列。
第二获取模块203,用于获取第二药物语句样本集和第三药物语句样本集,所述第二药物语句样本集中的每个第二药物语句样本包含一个化学物质标签,所述第三药物语句样本集中的每个第三药物语句样本包含一个治疗物质标签。
例如,第二药物语句样本集中的一个第二药物语句样本为“二甲双胍是一种白色可溶于水的粉末”,该第二药物语句样本的标签为“B-H I-H I-H I-H O O O O O O O O B-H O O O”。第三药物语句样本集中的一个第三药物语句样本为“在糖尿病治疗中经常使用二甲双胍作为主要药物”,该第三药物语句样本的标签为“O O O O O O O O O O O B-Z I-Z I-Z I-Z O O O O O O”。其中,“O”为非命名实体,“B-H”为化学物质标签的起始标签,“I-H”为化学物质标签的中间标签,“B-Z”为治疗物质标签的起始标签,“I-Z”治疗物质标签的中间标签。
第二训练模块204,用于用所述编码模型提取所述第二药物语句样本的向量序列,以所述第二药物样本的向量序列为输入,以根据所述第二药物样本的化学物质标签训练 化学物质识别模型。
在一具体实施例中,所述化学物质识别模型包括:
基于长短期记忆网络和条件随机场的模型;或
基于双向长短期记忆网络和条件随机场的模型;或
基于BiGRU和条件随机场的模型。
例如,所述化学物质识别模型由长短期记忆网络和接于长短期记忆网络后的条件随机场组成;可以用编码模型提取第二药物语句样本的向量序列,用长短期记忆网络提取第二药物语句样本的上下文语义特征,得到第二药物语句样本的中间向量序列;以中间向量序列为输入用条件随机场输出第二药物语句样本的化学物质预测标签,根据化学物质标签和化学物质预测标签优化长短期记忆网络和条件随机场的参数。
第三训练模块205,用于用所述编码模型提取所述第三药物语句样本的向量序列,以所述第三药物样本的向量序列为输入,以根据所述第三药物样本的治疗物质标签训练治疗物质识别模型。
在一具体实施例中,所述治疗物质识别模型包括:
基于长短期记忆网络和条件随机场的模型;或
基于双向长短期记忆网络和条件随机场的模型;或
基于BiGRU和条件随机场的模型。
例如,所述治疗物质识别模型由双向长短期记忆网络和接于双向长短期记忆网络后的条件随机场组成;可以用编码模型提取第三药物语句样本的向量序列,用双向长短期记忆网络提取第三药物语句样本的上下文语义特征,得到第三药物语句样本的中间向量序列;以中间向量序列为输入用条件随机场输出第三药物语句样本的治疗物质预测标签,根据治疗物质标签和治疗物质预测标签优化双向长短期记忆网络和条件随机场的参数。
第三获取模块206,用于获取待识别药物语句。
例如,待识别药物语句可以为“二甲双胍是一种白色可溶于水的粉末,在糖尿病治疗中经常使用二甲双胍作为主要药物,在用药期间患者应严格控制葡萄糖的摄入”。
第一识别模块207,用于用所述编码模型提取所述待识别药物语句的向量序列,用所述化学物质识别模型通过识别所述待识别药物语句的向量序列得到化学物质实体集,用所述治疗物质识别模型通过识别所述待识别药物语句的向量序列得到治疗物质实体集。
例如,用编码模型提取上述待识别药物语句的向量序列,用训练好的化学物质识别模型通过识别上述述待识别药物语句的向量序列得到的化学物质实体集为{二甲双胍,二甲双胍,葡萄糖},用所述治疗物质识别模型通过识别上述待识别药物语句的向量序列得到的治疗物质实体集为{二甲双胍}。
第二识别模块208,用于将所述化学物质实体集和所述治疗物质实体集中都存在的物质实体识别为药物。
例如,化学物质实体集为{二甲双胍,二甲双胍,葡萄糖},治疗物质实体集为{二甲双胍},则将化学物质实体集和所述治疗物质实体集中都存在的物质实体“二甲双胍”识别为药物。
实施例二的基于机器学习的药物识别装置20获取第一药物语句样本集,所述第一药物语句样本集中的每个第一药物语句样本包含一个缺失词和一个缺失词标签;用所述第一药物语句样本集训练编码模型;获取第二药物语句样本集和第三药物语句样本集,所述第二药物语句样本集中的每个第二药物语句样本包含一个化学物质标签,所述第三药物语句样本集中的每个第三药物语句样本包含一个治疗物质标签;用所述编码模型提取所述第二药物语句样本的向量序列,以所述第二药物样本的向量序列为输入,以根据所述第二药物样本的化学物质标签训练化学物质识别模型;用所述编码模型提取所述第三药物语句样本的向量序列,以所述第三药物样本的向量序列为输入,以根据所述第三药物样本的治疗物质标签训练治疗物质识别模型;获取待识别药物语句;用所述编码模 型提取所述待识别药物语句的向量序列,用所述化学物质识别模型通过识别所述待识别药物语句的向量序列得到化学物质实体集,用所述治疗物质识别模型通过识别所述待识别药物语句的向量序列得到治疗物质实体集;将所述化学物质实体集和所述治疗物质实体集中都存在的物质实体识别为药物。用所述编码模型提取所述第二药物语句样本的向量序列和所述第三药物语句样本的向量序列分别提升了训练所述化学物质识别模型和所述治疗物质识别模型的效率。识别所述待识别药物语句中的化学物质实体比识别所述待识别药物语句中的治疗物质实体更稳定,将所述化学物质实体集和所述治疗物质实体集中都存在的物质实体识别为药物,降低了误识别药物的风险。因此,实施例二识别待识别药物语句中的药物,提升了药物识别的效率和准确率。
在另一实施例中,所述基于机器学习的药物识别装置20还包括:发送模块,用于输出所述化学物质实体集中存在且所述治疗物质实体集中不存在的物质实体;发送识别提醒。
在另一实施例中,所述基于机器学习的药物识别装置20还可以包括:构建模块,用于用识别出的药物构建药物知识图谱。
实施例三
本实施例提供了一个或多个存储有计算机可读指令的可读存储介质,本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质;该计算机可读指令被一个或多个处理器执行时实现上述基于机器学习的药物识别方法实施例中的步骤,例如图1所示的步骤101-108:
101,获取第一药物语句样本集,所述第一药物语句样本集中的每个第一药物语句样本包含一个缺失词和一个缺失词标签;
102,用所述第一药物语句样本集训练编码模型;
103,获取第二药物语句样本集和第三药物语句样本集,所述第二药物语句样本集中的每个第二药物语句样本包含一个化学物质标签,所述第三药物语句样本集中的每个第三药物语句样本包含一个治疗物质标签;
104,用所述编码模型提取所述第二药物语句样本的向量序列,以所述第二药物样本的向量序列为输入,以根据所述第二药物样本的化学物质标签训练化学物质识别模型;
105,用所述编码模型提取所述第三药物语句样本的向量序列,以所述第三药物样本的向量序列为输入,以根据所述第三药物样本的治疗物质标签训练治疗物质识别模型;
106,获取待识别药物语句;
107,用所述编码模型提取所述待识别药物语句的向量序列,用所述化学物质识别模型通过识别所述待识别药物语句的向量序列得到化学物质实体集,用所述治疗物质识别模型通过识别所述待识别药物语句的向量序列得到治疗物质实体集;
108,将所述化学物质实体集和所述治疗物质实体集中都存在的物质实体识别为药物。
或者,该计算机可读指令被处理器执行时实现上述装置实施例中各模块的功能,例如图2中的模块201-208:
第一获取模块201,用于获取第一药物语句样本集,所述第一药物语句样本集中的每个第一药物语句样本包含一个缺失词和一个缺失词标签;
第一训练模块202,用于用所述第一药物语句样本集训练编码模型;
第二获取模块203,用于获取第二药物语句样本集和第三药物语句样本集,所述第二药物语句样本集中的每个第二药物语句样本包含一个化学物质标签,所述第三药物语句样本集中的每个第三药物语句样本包含一个治疗物质标签;
第二训练模块204,用于用所述编码模型提取所述第二药物语句样本的向量序列,以所述第二药物样本的向量序列为输入,以根据所述第二药物样本的化学物质标签训练化学物质识别模型;
第三训练模块205,用于用所述编码模型提取所述第三药物语句样本的向量序列,以所述第三药物样本的向量序列为输入,以根据所述第三药物样本的治疗物质标签训练治疗物质识别模型;
第三获取模块206,用于获取待识别药物语句;
第一识别模块207,用于用所述编码模型提取所述待识别药物语句的向量序列,用所述化学物质识别模型通过识别所述待识别药物语句的向量序列得到化学物质实体集,用所述治疗物质识别模型通过识别所述待识别药物语句的向量序列得到治疗物质实体集;
第二识别模块208,用于将所述化学物质实体集和所述治疗物质实体集中都存在的物质实体识别为药物。
实施例四
图3为本申请实施例三提供的计算机设备的示意图。所述计算机设备30包括存储器301、处理器302以及存储在所述存储器301中并可在所述处理器302上运行的计算机可读指令303,例如基于机器学习的药物识别程序。所述处理器302执行所述计算机可读指令303时实现上述基于机器学习的药物识别方法实施例中的步骤,例如图1所示的101-108:
101,获取第一药物语句样本集,所述第一药物语句样本集中的每个第一药物语句样本包含一个缺失词和一个缺失词标签;
102,用所述第一药物语句样本集训练编码模型;
103,获取第二药物语句样本集和第三药物语句样本集,所述第二药物语句样本集中的每个第二药物语句样本包含一个化学物质标签,所述第三药物语句样本集中的每个第三药物语句样本包含一个治疗物质标签;
104,用所述编码模型提取所述第二药物语句样本的向量序列,以所述第二药物样本的向量序列为输入,以根据所述第二药物样本的化学物质标签训练化学物质识别模型;
105,用所述编码模型提取所述第三药物语句样本的向量序列,以所述第三药物样本的向量序列为输入,以根据所述第三药物样本的治疗物质标签训练治疗物质识别模型;
106,获取待识别药物语句;
107,用所述编码模型提取所述待识别药物语句的向量序列,用所述化学物质识别模型通过识别所述待识别药物语句的向量序列得到化学物质实体集,用所述治疗物质识别模型通过识别所述待识别药物语句的向量序列得到治疗物质实体集;
108,将所述化学物质实体集和所述治疗物质实体集中都存在的物质实体识别为药物。
或者,该计算机可读指令被处理器执行时实现上述装置实施例中各模块的功能,例如图2中的模块201-208:
第一获取模块201,用于获取第一药物语句样本集,所述第一药物语句样本集中的每个第一药物语句样本包含一个缺失词和一个缺失词标签;
第一训练模块202,用于用所述第一药物语句样本集训练编码模型;
第二获取模块203,用于获取第二药物语句样本集和第三药物语句样本集,所述第二药物语句样本集中的每个第二药物语句样本包含一个化学物质标签,所述第三药物语句样本集中的每个第三药物语句样本包含一个治疗物质标签;
第二训练模块204,用于用所述编码模型提取所述第二药物语句样本的向量序列,以所述第二药物样本的向量序列为输入,以根据所述第二药物样本的化学物质标签训练化学物质识别模型;
第三训练模块205,用于用所述编码模型提取所述第三药物语句样本的向量序列,以所述第三药物样本的向量序列为输入,以根据所述第三药物样本的治疗物质标签训练治疗物质识别模型;
第三获取模块206,用于获取待识别药物语句;
第一识别模块207,用于用所述编码模型提取所述待识别药物语句的向量序列,用所述化学物质识别模型通过识别所述待识别药物语句的向量序列得到化学物质实体集,用所述治疗物质识别模型通过识别所述待识别药物语句的向量序列得到治疗物质实体集;
第二识别模块208,用于将所述化学物质实体集和所述治疗物质实体集中都存在的物质实体识别为药物。
示例性的,所述计算机可读指令303可以被分割成一个或多个模块,所述一个或者多个模块被存储在所述存储器301中,并由所述处理器302执行,以完成本方法。所述一个或多个模块可以是能够完成特定功能的一系列计算机可读指令指令段,该指令段用于描述所述计算机可读指令303在所述计算机设备30中的执行过程。例如,所述计算机可读指令303可以被分割成图2中的第一获取模块201、第一训练模块202、第二获取模块203、第二训练模块204、第三训练模块205、第三获取模块206、第一识别模块207、第二识别模块208,各模块具体功能参见实施例二。
本领域技术人员可以理解,所述示意图3仅仅是计算机设备30的示例,并不构成对计算机设备30的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述计算机设备30还可以包括输入输出设备、网络接入设备、总线等。
所称处理器302可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器302也可以是任何常规的处理器等,所述处理器302是所述计算机设备30的控制中心,利用各种接口和线路连接整个计算机设备30的各个部分。
所述存储器301可用于存储所述计算机可读指令303,所述处理器302通过运行或执行存储在所述存储器301内的计算机可读指令或模块,以及调用存储在存储器301内的数据,实现所述计算机设备30的各种功能。所述存储器301可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据计算机设备30的使用所创建的数据(比如音频数据等)等。此外,存储器301可以包括非易失性存储器或/和易失性存储器,非易失性存储器可包括例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。
所述计算机设备30集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一计算机可读存储介质中,该计算机可读指令在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机可读指令包括计算机可读指令代码,所述计算机可读指令代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机可读指令代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示 的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。
上述以软件功能模块的形式实现的集成的模块,可以存储在一个计算机可读存储介质中。上述软件功能模块存储在一个可读存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施例所述方法的部分步骤。
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。此外,显然“包括”一词不排除其他模块或步骤,单数不排除复数。系统权利要求中陈述的多个模块或装置也可以由一个模块或装置通过软件或者硬件来实现。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。

Claims (20)

  1. 一种基于机器学习的药物识别方法,其中,所述方法包括:
    获取第一药物语句样本集,所述第一药物语句样本集中的每个第一药物语句样本包含一个缺失词和一个缺失词标签;
    用所述第一药物语句样本集训练编码模型;
    获取第二药物语句样本集和第三药物语句样本集,所述第二药物语句样本集中的每个第二药物语句样本包含一个化学物质标签,所述第三药物语句样本集中的每个第三药物语句样本包含一个治疗物质标签;
    用所述编码模型提取所述第二药物语句样本的向量序列,以所述第二药物样本的向量序列为输入,以根据所述第二药物样本的化学物质标签训练化学物质识别模型;
    用所述编码模型提取所述第三药物语句样本的向量序列,以所述第三药物样本的向量序列为输入,以根据所述第三药物样本的治疗物质标签训练治疗物质识别模型;
    获取待识别药物语句;
    用所述编码模型提取所述待识别药物语句的向量序列,用所述化学物质识别模型通过识别所述待识别药物语句的向量序列得到化学物质实体集,用所述治疗物质识别模型通过识别所述待识别药物语句的向量序列得到治疗物质实体集;
    将所述化学物质实体集和所述治疗物质实体集中都存在的物质实体识别为药物。
  2. 如权利要求1所述的方法,其中,所述获取第一药物语句样本集包括:
    通过光学字符识别对纸质版医学书籍进行扫描识别;
    利用网络爬虫从网络上抓取电子版医学文档;
    从扫描的纸质版医学书籍和抓取的电子版医学文档中提取多个药物语句;
    对所述多个药物语句进行清洗预处理;
    确定所述多个药物语句中的每个药物语句的缺失词和缺失词标签。
  3. 如权利要求1所述的方法,其中,所述编码模型为BERT模型,所述用所述第一药物语句样本集训练编码模型包括:
    生成每个第一药物语句样本的输入向量序列;
    以该第一药物语句样本的输入向量序列为输入,用所述BERT模型计算该第一药物语句样本的输出向量序列;
    以该第一药物语句样本的输出向量序列为输入,用预设全连接层计算该第一药物语句样本的缺失词向量;
    根据该第一药物语句样本的缺失词向量和标签优化所述BERT模型和所述预设全连接层。
  4. 如权利要求3所述的方法,其中,所述生成每个第一药物语句样本的输入向量序列包括:
    对每个第一药物语句样本进行分词,得到每个第一药物语句样本包含的词语;
    根据预设词语编码表获取每个第一药物语句样本的每个词语的编码向量;
    根据每个第一药物语句样本的每个词语的位置编号生成该词语的位置向量;
    拼接每个第一药物语句样本的每个词语的编码向量和位置向量,得到该第一药物语句样本的每个词语的编码输入向量;
    依词序组合每个第一药物语句样本的每个词语的编码输入向量,得到该第一药物语句样本的输入向量序列。
  5. 如权利要求1所述的方法,其中,所述化学物质识别模型包括:
    基于长短期记忆网络和条件随机场的模型;或
    基于双向长短期记忆网络和条件随机场的模型;或
    基于BiGRU和条件随机场的模型。
  6. 如权利要求1所述的方法,其中,所述方法还包括:
    输出所述化学物质实体集中存在且所述治疗物质实体集中不存在的物质实体;
    发送识别提醒。
  7. 如权利要求1所述的方法,其中,所述方法还包括:
    用识别出的药物构建药物知识图谱。
  8. 一种基于机器学习的药物识别装置,其中,所述装置包括:
    第一获取模块,用于获取第一药物语句样本集,所述第一药物语句样本集中的每个第一药物语句样本包含一个缺失词和一个缺失词标签;
    第一训练模块,用于用所述第一药物语句样本集训练编码模型;
    第二获取模块,用于获取第二药物语句样本集和第三药物语句样本集,所述第二药物语句样本集中的每个第二药物语句样本包含一个化学物质标签,所述第三药物语句样本集中的每个第三药物语句样本包含一个治疗物质标签;
    第二训练模块,用于用所述编码模型提取所述第二药物语句样本的向量序列,以所述第二药物样本的向量序列为输入,以根据所述第二药物样本的化学物质标签训练化学物质识别模型;
    第三训练模块,用于用所述编码模型提取所述第三药物语句样本的向量序列,以所述第三药物样本的向量序列为输入,以根据所述第三药物样本的治疗物质标签训练治疗物质识别模型;
    第三获取模块,用于获取待识别药物语句;
    第一识别模块,用于用所述编码模型提取所述待识别药物语句的向量序列,用所述化学物质识别模型通过识别所述待识别药物语句的向量序列得到化学物质实体集,用所述治疗物质识别模型通过识别所述待识别药物语句的向量序列得到治疗物质实体集;
    第二识别模块,用于将所述化学物质实体集和所述治疗物质实体集中都存在的物质实体识别为药物。
  9. 如权利要求8所述的装置,其中,所第一获取模块还用于:
    通过光学字符识别对纸质版医学书籍进行扫描识别;
    利用网络爬虫从网络上抓取电子版医学文档;
    利用网络爬虫从网络上抓取电子版医学文档;
    从扫描的纸质版医学书籍和抓取的电子版医学文档中提取多个药物语句;
    对所述多个药物语句进行清洗预处理;
    确定所述多个药物语句中的每个药物语句的缺失词和缺失词标签。
  10. 如权利要求8所述的装置,其中,所述编码模型为BERT模型,所述第一训练模块还用于:
    生成每个第一药物语句样本的输入向量序列;
    以该第一药物语句样本的输入向量序列为输入,用所述BERT模型计算该第一药物语句样本的输出向量序列;
    以该第一药物语句样本的输出向量序列为输入,用预设全连接层计算该第一药物语句样本的缺失词向量;
    根据该第一药物语句样本的缺失词向量和标签优化所述BERT模型和所述预设全连接层。
  11. 一种计算机设备,其中,所述计算机装置包括存储器和处理器,所述处理器用于执行存储器中存储的计算机可读指令以实现如下步骤:
    获取第一药物语句样本集,所述第一药物语句样本集中的每个第一药物语句样本包含一个缺失词和一个缺失词标签;
    用所述第一药物语句样本集训练编码模型;
    获取第二药物语句样本集和第三药物语句样本集,所述第二药物语句样本集中的每 个第二药物语句样本包含一个化学物质标签,所述第三药物语句样本集中的每个第三药物语句样本包含一个治疗物质标签;
    用所述编码模型提取所述第二药物语句样本的向量序列,以所述第二药物样本的向量序列为输入,以根据所述第二药物样本的化学物质标签训练化学物质识别模型;
    用所述编码模型提取所述第三药物语句样本的向量序列,以所述第三药物样本的向量序列为输入,以根据所述第三药物样本的治疗物质标签训练治疗物质识别模型;
    获取待识别药物语句;
    用所述编码模型提取所述待识别药物语句的向量序列,用所述化学物质识别模型通过识别所述待识别药物语句的向量序列得到化学物质实体集,用所述治疗物质识别模型通过识别所述待识别药物语句的向量序列得到治疗物质实体集;
    将所述化学物质实体集和所述治疗物质实体集中都存在的物质实体识别为药物。
  12. 如权利要求11所述的计算机设备,其中,所述获取第一药物语句样本集包括:
    通过光学字符识别对纸质版医学书籍进行扫描识别;
    利用网络爬虫从网络上抓取电子版医学文档;
    从扫描的纸质版医学书籍和抓取的电子版医学文档中提取多个药物语句;
    对所述多个药物语句进行清洗预处理;
    确定所述多个药物语句中的每个药物语句的缺失词和缺失词标签。
  13. 如权利要求11所述的计算机设备,其中,所述编码模型为BERT模型,所述用所述第一药物语句样本集训练编码模型包括:
    生成每个第一药物语句样本的输入向量序列;
    以该第一药物语句样本的输入向量序列为输入,用所述BERT模型计算该第一药物语句样本的输出向量序列;
    以该第一药物语句样本的输出向量序列为输入,用预设全连接层计算该第一药物语句样本的缺失词向量;
    根据该第一药物语句样本的缺失词向量和标签优化所述BERT模型和所述预设全连接层。
  14. 如权利要求13所述的计算机设备,其中,所述生成每个第一药物语句样本的输入向量序列包括:
    对每个第一药物语句样本进行分词,得到每个第一药物语句样本包含的词语;
    根据预设词语编码表获取每个第一药物语句样本的每个词语的编码向量;
    根据每个第一药物语句样本的每个词语的位置编号生成该词语的位置向量;
    拼接每个第一药物语句样本的每个词语的编码向量和位置向量,得到该第一药物语句样本的每个词语的编码输入向量;
    依词序组合每个第一药物语句样本的每个词语的编码输入向量,得到该第一药物语句样本的输入向量序列。
  15. 如权利要求11所述的计算机设备,其中,所述化学物质识别模型包括:
    基于长短期记忆网络和条件随机场的模型;或
    基于双向长短期记忆网络和条件随机场的模型;或
    基于BiGRU和条件随机场的模型。
  16. 如权利要求11所述的计算机设备,其中,所述处理器还用于执行存储器中存储的计算机可读指令以实现如下步骤:
    输出所述化学物质实体集中存在且所述治疗物质实体集中不存在的物质实体;
    发送识别提醒。
  17. 如权利要求11所述的计算机设备,其中,所述处理器还用于执行存储器中存储的计算机可读指令以实现如下步骤:
    用识别出的药物构建药物知识图谱。
  18. 一个或多个存储有计算机可读指令的可读存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
    获取第一药物语句样本集,所述第一药物语句样本集中的每个第一药物语句样本包含一个缺失词和一个缺失词标签;
    用所述第一药物语句样本集训练编码模型;
    获取第二药物语句样本集和第三药物语句样本集,所述第二药物语句样本集中的每个第二药物语句样本包含一个化学物质标签,所述第三药物语句样本集中的每个第三药物语句样本包含一个治疗物质标签;
    用所述编码模型提取所述第二药物语句样本的向量序列,以所述第二药物样本的向量序列为输入,以根据所述第二药物样本的化学物质标签训练化学物质识别模型;
    用所述编码模型提取所述第三药物语句样本的向量序列,以所述第三药物样本的向量序列为输入,以根据所述第三药物样本的治疗物质标签训练治疗物质识别模型;
    获取待识别药物语句;
    用所述编码模型提取所述待识别药物语句的向量序列,用所述化学物质识别模型通过识别所述待识别药物语句的向量序列得到化学物质实体集,用所述治疗物质识别模型通过识别所述待识别药物语句的向量序列得到治疗物质实体集;
    将所述化学物质实体集和所述治疗物质实体集中都存在的物质实体识别为药物。
  19. 如权利要求18所述的可读存储介质,其中,所述获取第一药物语句样本集包括:
    通过光学字符识别对纸质版医学书籍进行扫描识别;
    利用网络爬虫从网络上抓取电子版医学文档;
    从扫描的纸质版医学书籍和抓取的电子版医学文档中提取多个药物语句;
    对所述多个药物语句进行清洗预处理;
    确定所述多个药物语句中的每个药物语句的缺失词和缺失词标签。
  20. 如权利要求18所述的可读存储介质,其中,所述编码模型为BERT模型,所述用所述第一药物语句样本集训练编码模型包括:
    生成每个第一药物语句样本的输入向量序列;
    以该第一药物语句样本的输入向量序列为输入,用所述BERT模型计算该第一药物语句样本的输出向量序列;
    以该第一药物语句样本的输出向量序列为输入,用预设全连接层计算该第一药物语句样本的缺失词向量;
    根据该第一药物语句样本的缺失词向量和标签优化所述BERT模型和所述预设全连接层。
PCT/CN2020/093319 2020-03-04 2020-05-29 基于机器学习的药物识别方法及相关设备 WO2021174695A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010144271.X 2020-03-04
CN202010144271.XA CN111523316A (zh) 2020-03-04 2020-03-04 基于机器学习的药物识别方法及相关设备

Publications (1)

Publication Number Publication Date
WO2021174695A1 true WO2021174695A1 (zh) 2021-09-10

Family

ID=71901988

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/093319 WO2021174695A1 (zh) 2020-03-04 2020-05-29 基于机器学习的药物识别方法及相关设备

Country Status (2)

Country Link
CN (1) CN111523316A (zh)
WO (1) WO2021174695A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114048148A (zh) * 2022-01-13 2022-02-15 广东拓思软件科学园有限公司 一种众包测试报告推荐方法、装置及电子设备
CN114420309A (zh) * 2021-09-13 2022-04-29 北京百度网讯科技有限公司 建立药物协同作用预测模型的方法、预测方法及对应装置

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112016309B (zh) * 2020-09-04 2024-03-08 平安科技(深圳)有限公司 抽取药物组合方法、设备、装置及存储介质
US11893776B2 (en) 2020-10-30 2024-02-06 Boe Technology Group Co., Ltd. Image recognition method and apparatus, training method, electronic device, and storage medium
WO2022246691A1 (zh) * 2021-05-26 2022-12-01 深圳晶泰科技有限公司 一种小分子药物晶型知识图谱的构建方法及系统

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130066870A1 (en) * 2011-09-12 2013-03-14 Siemens Corporation System for Generating a Medical Knowledge Base
US20140278554A1 (en) * 2013-03-14 2014-09-18 Koninklijke Philips N.V. Using image references in radiology reports to support report-to-image navigation
CN106708959A (zh) * 2016-11-30 2017-05-24 重庆大学 一种基于医学文献数据库的组合药物识别与排序方法
CN106919794A (zh) * 2017-02-24 2017-07-04 黑龙江特士信息技术有限公司 面向多数据源的药品类实体识别方法及装置
CN108932342A (zh) * 2018-07-18 2018-12-04 腾讯科技(深圳)有限公司 一种语义匹配的方法、模型的学习方法及服务器
CN109829156A (zh) * 2019-01-18 2019-05-31 北京惠每云科技有限公司 医学文本识别方法及装置
CN109871545A (zh) * 2019-04-22 2019-06-11 京东方科技集团股份有限公司 命名实体识别方法及装置
CN110263167A (zh) * 2019-06-20 2019-09-20 北京百度网讯科技有限公司 医疗实体分类模型生成方法、装置、设备和可读存储介质
CN110598695A (zh) * 2019-08-15 2019-12-20 北京搜狗科技发展有限公司 一种药品识别方法、装置和用于药品识别的装置

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130066870A1 (en) * 2011-09-12 2013-03-14 Siemens Corporation System for Generating a Medical Knowledge Base
US20140278554A1 (en) * 2013-03-14 2014-09-18 Koninklijke Philips N.V. Using image references in radiology reports to support report-to-image navigation
CN106708959A (zh) * 2016-11-30 2017-05-24 重庆大学 一种基于医学文献数据库的组合药物识别与排序方法
CN106919794A (zh) * 2017-02-24 2017-07-04 黑龙江特士信息技术有限公司 面向多数据源的药品类实体识别方法及装置
CN108932342A (zh) * 2018-07-18 2018-12-04 腾讯科技(深圳)有限公司 一种语义匹配的方法、模型的学习方法及服务器
CN109829156A (zh) * 2019-01-18 2019-05-31 北京惠每云科技有限公司 医学文本识别方法及装置
CN109871545A (zh) * 2019-04-22 2019-06-11 京东方科技集团股份有限公司 命名实体识别方法及装置
CN110263167A (zh) * 2019-06-20 2019-09-20 北京百度网讯科技有限公司 医疗实体分类模型生成方法、装置、设备和可读存储介质
CN110598695A (zh) * 2019-08-15 2019-12-20 北京搜狗科技发展有限公司 一种药品识别方法、装置和用于药品识别的装置

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114420309A (zh) * 2021-09-13 2022-04-29 北京百度网讯科技有限公司 建立药物协同作用预测模型的方法、预测方法及对应装置
CN114420309B (zh) * 2021-09-13 2023-11-21 北京百度网讯科技有限公司 建立药物协同作用预测模型的方法、预测方法及对应装置
CN114048148A (zh) * 2022-01-13 2022-02-15 广东拓思软件科学园有限公司 一种众包测试报告推荐方法、装置及电子设备

Also Published As

Publication number Publication date
CN111523316A (zh) 2020-08-11

Similar Documents

Publication Publication Date Title
WO2021174695A1 (zh) 基于机器学习的药物识别方法及相关设备
Calvo-Zaragoza et al. End-to-end neural optical music recognition of monophonic scores
CN109086357B (zh) 基于变分自动编码器的情感分类方法、装置、设备及介质
CN106919793B (zh) 一种医疗大数据的数据标准化处理方法及装置
Abdallah et al. Attention-based fully gated CNN-BGRU for Russian handwritten text
CN110032728A (zh) 疾病名称标准化的转换方法和装置
CN112154509A (zh) 具有用于文本注释的演变领域特异性词典特征的机器学习模型
CN105184053A (zh) 一种中文医疗服务项目信息的自动编码方法及系统
WO2023040493A1 (zh) 事件检测
CN111581972A (zh) 文本中症状和部位对应关系识别方法、装置、设备及介质
CN112131881A (zh) 信息抽取方法及装置、电子设备、存储介质
CN111986793A (zh) 基于人工智能的导诊处理方法、装置、计算机设备及介质
CN112966117A (zh) 实体链接方法
CN116912847A (zh) 一种医学文本识别方法、装置、计算机设备及存储介质
Ibrayim et al. An effective method for detection and recognition of Uyghur texts in images with backgrounds
CN113297852B (zh) 一种医学实体词的识别方法和装置
Idrees et al. Exploiting script similarities to compensate for the large amount of data in training tesseract lstm: Towards kurdish ocr
Kim et al. Multimedia vision for the visually impaired through 2d multiarray braille display
Najam et al. Analysis of recent deep learning techniques for Arabic handwritten-text OCR and Post-OCR correction
Zhang et al. Efficient end-to-end sentence-level lipreading with temporal convolutional networks
Tan et al. A pipeline approach to context-aware handwritten text recognition
CN116842944A (zh) 一种基于词增强的实体关系抽取方法及装置
CN116341519A (zh) 基于背景知识的事件因果关系抽取方法、装置及存储介质
Pondenkandath et al. Cross-Depicted Historical Motif Categorization and Retrieval with Deep Learning
CN112199954B (zh) 基于语音语义的疾病实体匹配方法、装置及计算机设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20923181

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20923181

Country of ref document: EP

Kind code of ref document: A1