WO2020211250A1 - 中文病历的实体识别方法、装置、设备及存储介质 - Google Patents

中文病历的实体识别方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2020211250A1
WO2020211250A1 PCT/CN2019/103379 CN2019103379W WO2020211250A1 WO 2020211250 A1 WO2020211250 A1 WO 2020211250A1 CN 2019103379 W CN2019103379 W CN 2019103379W WO 2020211250 A1 WO2020211250 A1 WO 2020211250A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature vector
vector
phrase
character
chinese medical
Prior art date
Application number
PCT/CN2019/103379
Other languages
English (en)
French (fr)
Inventor
丁佳佳
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Priority to SG11202008377SA priority Critical patent/SG11202008377SA/en
Publication of WO2020211250A1 publication Critical patent/WO2020211250A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Definitions

  • the invention relates to the field of natural language processing, and relates to an entity recognition method, device, equipment and storage medium of Chinese medical records.
  • the inventor realizes that the effect of existing Chinese named entity recognition based on deep learning is difficult to improve, and it was previously applied to other languages, such as English. Because of the limitations of deep learning models and the differences in language characteristics between languages, the application of named entity tasks in Chinese is restricted. And because of the difference between the general field, other fields and the medical field, its application in the direction of cases in the medical field is limited.
  • the technical problem to be solved by the present invention is to overcome the low accuracy of Chinese named entity recognition based on deep learning in the prior art, and proposes an entity recognition method, device, equipment and storage medium for Chinese medical records.
  • the text content of the case extracts the corresponding features and converts them into feature vectors, and then uses the feature vectors as the input of the model to improve the accuracy of entity recognition.
  • a method for entity recognition of Chinese medical records includes the following steps:
  • Use the word segmentation tool to segment the Chinese medical records use the phrase obtained after the segmentation as a unit, and output a second feature vector for each character to represent the position of each character in the phrase according to the second correspondence rule;
  • the first feature vector, the second feature vector, the third feature vector, the fourth feature vector, and the fifth feature vector are correspondingly spliced after the initial vector of each word to Obtaining a vector set for characterizing the Chinese medical record;
  • the vector set used to characterize the Chinese medical records is input into the trained model to extract entities therein.
  • the invention also discloses an entity recognition device for Chinese medical records, which includes:
  • the first feature vector generating module is used to identify the personal information contained in the Chinese medical record, and output a first feature vector corresponding to the personal information according to the first correspondence rule. Each character in the Chinese medical record corresponds to the same The first feature vector;
  • the second feature vector generation module is used to segment the Chinese medical records using a word segmentation tool, and use the phrase obtained after word segmentation as a unit, and output the corresponding word for each word according to the second correspondence rule to characterize the position of each word in the phrase Second eigenvector;
  • the third feature vector generating module is configured to identify the radical of each character in the Chinese medical record, and output a third feature vector corresponding to the radical of each character according to the third correspondence rule;
  • the fourth feature vector generation module is used to perform n-gram traversal on the Chinese medical records, and match each phrase obtained after traversal with the preset original medical dictionary, prefix dictionary and suffix dictionary, and correspond to the fourth according to the matching result
  • the rule outputs the corresponding fourth feature vector for each word
  • the fifth feature vector generating module is used to convert each character in the Chinese medical record into pinyin using a Chinese pinyin conversion tool, and output a fifth feature corresponding to the pinyin of each character according to the fifth correspondence rule corresponding to each word vector;
  • the vector set generating module is configured to splice the first feature vector, the second feature vector, the third feature vector, the fourth feature vector, and the fifth feature vector into each After the initial vector of the character, to obtain a vector set for characterizing the Chinese medical record;
  • the entity recognition model is used to input the vector set used to characterize the Chinese medical records into the trained model to extract the entities therein.
  • the present invention also discloses a computer device, including a memory and a processor, and a computer program is stored on the memory.
  • a computer program is stored on the memory.
  • Use the word segmentation tool to segment the Chinese medical records use the phrase obtained after the segmentation as a unit, and output a second feature vector for each character to represent the position of each character in the phrase according to the second correspondence rule;
  • the first feature vector, the second feature vector, the third feature vector, the fourth feature vector, and the fifth feature vector are correspondingly spliced after the initial vector of each word to Obtaining a vector set for characterizing the Chinese medical record;
  • the vector set used to characterize the Chinese medical records is input into the trained model to extract entities therein.
  • the present invention also discloses a computer-readable storage medium in which a computer program is stored, and the computer program can be executed by at least one processor to implement the following steps:
  • Use the word segmentation tool to segment the Chinese medical records use the phrase obtained after the segmentation as a unit, and output a second feature vector for each character to represent the position of each character in the phrase according to the second correspondence rule;
  • the first feature vector, the second feature vector, the third feature vector, the fourth feature vector, and the fifth feature vector are correspondingly spliced after the initial vector of each word to Obtaining a vector set for characterizing the Chinese medical record;
  • the vector set used to characterize the Chinese medical records is input into the trained model to extract entities therein.
  • the present invention first recognizes the entity information in the Chinese medical record and converts it into a feature vector, and then uses the vector set converted into the Chinese medical record as the input of the model to improve the accuracy of the model for entity extraction.
  • Fig. 1 shows a flowchart of an embodiment of a method for entity recognition of Chinese medical records according to the present invention
  • Figure 2 shows a structural diagram of an embodiment of an entity recognition device for Chinese medical records of the present invention
  • Fig. 3 shows a schematic diagram of the hardware architecture of an embodiment of the computer device of the present invention.
  • the present invention proposes an entity recognition method for Chinese medical records.
  • the entity recognition method for Chinese medical records includes the following steps:
  • Step 01 Identify the personal information contained in the Chinese medical record, and output a first feature vector corresponding to the personal information according to a first correspondence rule, and each word in the Chinese medical record corresponds to the same first feature vector .
  • regular expression matching can be used.
  • regular expression is a kind of logical formula for string manipulation. It uses some pre-defined specific characters and a combination of these specific characters to form a "rule string”. This "rule string” is used to express the right A filtering logic for strings.
  • the personal information mentioned here mainly refers to the type of patient and the age of the patient.
  • the choice to identify these personal information is because basically every case can reflect the type of patient (male, female, young, elderly, young, child, baby, etc.) and patient According to different patient types and patient ages, doctors may adopt corresponding treatment methods and inspection methods. Therefore, identification based on patient type and patient age is beneficial to the analysis of medical records.
  • basic patient information Identify the two characteristics of patient type and patient age.
  • the length of the feature vector is equal to the number of types of the patient type; each dimension in the feature vector corresponds to one type of the patient type; The feature vector characterizes the corresponding patient type through the change of the vector value of the corresponding dimension of the patient type.
  • the length of the feature vector is 1; the feature vector corresponds to different types of patients through different vector values.
  • the length of the feature vector is 5, assuming that the initial feature vector is [0,0,0,0,0], each dimension in the feature vector corresponds to a patient type, assuming that the feature vector in The patient types corresponding to each dimension from front to back are "male, female, elderly, infant, and child".
  • the feature value of the first dimension corresponding to the patient type "male” in the initial feature vector is changed from 0 to 1, that is, the feature vector [1 ,0,0,0,0] represents the patient type "male”; if the patient type is identified as "elderly”, it is represented by the feature vector [0,0,1,0,0].
  • the length of the feature vector is 1, that is, the initial feature vector is [0], and different numbers are used to correspond to five patient types.
  • the numbers 1, 2, 3, 4, and 5 correspond to the patients in turn Type "male, female, elderly, infant, child”.
  • the feature value of the initial feature vector changes from 0 to 1, that is, the feature vector [1] is used to represent the patient type "male”;
  • the patient type is "elderly”, then it is represented by the feature vector [3].
  • the corresponding rule between the age of the patient and the feature vector is specifically as follows: the length of the feature vector is 1; the feature vector corresponds to different ages of the patients through different vector values, and the vector value is equal to the age of the patient.
  • the following takes the patient’s age as "78 years” as an example to illustrate the corresponding rules between the patient’s age and the feature vector.
  • the length of the feature vector is 1, that is, the initial feature vector is [0].
  • the initial feature vector’s vector value is changed from 0 to 78, that is, the feature vector [ 78] indicates that the patient is 78 years old.
  • Step 02 Use a word segmentation tool to segment the Chinese medical record, and use the phrase obtained after word segmentation as a unit, and output a second feature vector for representing the position of each character in the phrase corresponding to each character according to the second correspondence rule.
  • the word segmentation tool also uses Chinese word segmentation tools.
  • the word segmentation tools mentioned here are all existing, and the common ones are jieba, SnowNLP, THULAC, NLPIR, etc., which will not be described in detail.
  • the word segmentation tool to segment the medical record. Take the content of the medical record "the mass above the rectum and peritoneum refolding, combined with the preoperative colonoscopy and pathological diagnosis of rectal and anal cancer, the decision to perform Miles operation" as an example, the word segmentation is obtained "Rectal peritoneum/return/upper/unreached/mass/,/combination/preoperative/colonoscopy/and/pathology/intraoperative diagnosis/is/rectal and anal cancer/,/decision/line/Miles operation/".
  • the second correspondence rule is specifically: the length of the feature vector is 4; the first three dimensions of the feature vector are used to characterize phrases containing more than two characters, and the change in the vector value of the first dimension is used to characterize the The change in the vector value of the second dimension is used to characterize the word located in the middle of the phrase, and the change in the vector value of the third dimension is used to characterize the word located at the end of the phrase; The fourth dimension of the feature vector is used to characterize single-character phrases, and the change of the vector value of the fourth dimension is used to characterize the characters in the single-character phrases.
  • each word corresponds to an initial feature vector.
  • the length of the feature vector is 4, so the initial feature vector corresponding to each word here is [ 0,0,0,0]. Since the phrase "rectal peritoneum" is a four-character phrase, only the first three dimensions of the feature vector are used.
  • the initial feature vector corresponding to the word is also [0,0,0,0]. Since it is a single-sub phrase, only the fourth dimension of the feature vector is used, that is, change The vector value of the fourth dimension in the initial feature vector (changed from 0 to 1), and the feature vector of "and" is [0,0,0,1].
  • Step 03 Identify the radical of each character in the Chinese medical record, and output a third feature vector corresponding to the radical of each character according to the third correspondence rule.
  • the third correspondence rule specifically includes two types: one, the length of the feature vector is equal to the preset number of entity radicals; each dimension in the feature vector corresponds to one entity radical; the feature vector passes through the entity The change of the vector value of the corresponding dimension of the radical represents the corresponding character containing the radical of the entity; second, the length of the feature vector is 1; the feature vector corresponds to the character containing the radicals of different entities through different vector values.
  • the entity radicals are preset according to specific needs.
  • the two most effective entity radicals are preset to be the sick word box (" ⁇ ") and the moon word (" ⁇ ").
  • the length of the feature vector is 2, and the corresponding initial feature vector is [0,0].
  • the change in the vector value of the first dimension is used to represent the diseased word box (" ⁇ "), and the second The change of the vector value of each dimension is used to characterize the month next to the word (“month”).
  • the corresponding initial feature vector is [0]
  • different vector values 1,2 are used to represent the diseased word box (" ⁇ ") and the month next to ( "month”).
  • the corresponding second feature vectors are all the same [2]
  • the first word "rectal peritoneum” "Zhi” is neither beside the month character (" ⁇ ") nor the disease box (" ⁇ ")
  • the second feature vector of the character is the initial feature vector [0]
  • the corresponding second feature vector is [1].
  • the preset entity radicals also include the head of bamboo (" ⁇ ") and the side of bone (“ ⁇ "), corresponding to the first corresponding rule, the length of the feature vector is 4, and the initial feature vector is [0, 0,0,0]; Corresponding to the second corresponding rule, the initial feature vector is [0], and different vector values 1, 2, 3, 4 are used to represent the diseased word box (" ⁇ ") and the month word respectively
  • Step 04 Perform n-gram traversal on the Chinese medical record, and match each phrase obtained after traversal with the preset original medical dictionary, prefix dictionary, and suffix dictionary, and output each word according to the matching result and the fourth corresponding rule The corresponding fourth feature vector.
  • n takes a non-zero natural number equal to or less than the length of the Chinese medical record.
  • n-gram traversal is a common method of natural language processing. In fact, it is word segmentation.
  • n is the length of each phrase after word segmentation.
  • the original medical dictionary can use any existing medical term dictionary.
  • each phrase corresponds to an entity category
  • the entity categories of the prefix dictionary and the suffix dictionary for several phrases based on the word group follow the entity category corresponding to the phrase in the original medical dictionary.
  • the construction of the prefix dictionary specifically includes: identifying a phrase with more than two characters in the original medical dictionary, and sequentially storing the first i words of the recognized phrase in the prefix dictionary, where i is less than the length of the phrase and greater than the A natural number that is half the length of a phrase, where half of the length of the phrase is an integer.
  • the dictionary includes "left intertrochanteric bone”, “left intertrochanteric bone”, “left intertrochanteric bone”.
  • the construction of the suffix dictionary specifically includes: identifying a phrase with more than two characters in the original medical dictionary; storing the last i words of the recognized phrase in the suffix dictionary, where i is less than the length of the phrase and greater than A natural number equal to half the length of the phrase, where half of the phrase length is an integer.
  • the suffix dictionary constructed by “intertrochanteric fracture” includes “lateral intertrochanteric fracture”, “intertrochanteric fracture”, “intertrochanteric fracture”, and “intertrochanteric fracture”.
  • the three dictionaries (original medical dictionary, prefix dictionary and suffix dictionary) can be matched at the same time, and the matching order can also be set. No matter which matching method is used, as long as the phrase matches one of the dictionaries, it stops matching with the other two dictionaries.
  • the matching requirement here is an exact match.
  • the matching result includes two types: matching and non-matching. When the matching result is a match, the matching result includes the matching dictionary noun, the matching medical noun, and the entity category corresponding to the medical noun
  • the fourth feature vector is used to distinguish the entity category, the length of the feature vector and the number of entity categories. For example, there are six types of entities, representing diseases and diagnoses, symptoms and signs, body parts, examinations and tests, surgery, and drugs.
  • the length of the corresponding feature vector is 6, then the initial feature vector is [0,0,0,0,0,0], and the vector value of each dimension corresponds to an entity category.
  • the initial feature vector is output corresponding to each word.
  • the applicable corresponding rules are specifically: the length of the feature vector is equal to the number of entity categories; each dimension in the feature vector corresponds to one entity category; Changing the vector value to the first vector value, the second vector value, or the third vector value corresponds to the first, middle, or last position of a single character in the phrase.
  • the applicable corresponding rules are specifically: the length of the feature vector is equal to the number of entity categories; each dimension in the feature vector corresponds to one entity category; the feature vector is determined by dividing the initial vector Changing the value to the first vector value or the second vector value corresponds to the first or non-first position of a single character in the phrase.
  • blood cell as an example for specific description, where the first vector value and the second vector value are taken as 1, 2 respectively.
  • the phrase appears in the prefix dictionary and is associated with the entity category of inspection and verification, so these three words all have to change the value of the fourth dimension vector.
  • the applicable corresponding rules are specifically: the length of the feature vector is equal to the number of entity categories; each dimension in the feature vector corresponds to one entity category; the feature vector is determined by dividing the initial vector Changing the value to the second vector value or the third vector value corresponds to the non-final or last position of a single character in the phrase.
  • n-gram traversal a sentence will be divided multiple times according to different number of words, so each word will get n feature vectors, but this feature vector will only have two possibilities, or it will not match the output
  • the initial feature vector is either the corresponding feature vector of the matching output (the corresponding feature vector of the matching output under each traversal is the same).
  • the final output corresponding to the word is the corresponding feature vector, unless there is no match each time, the final output corresponding to the word is the initial feature vector.
  • Step 05 Use a Chinese pinyin conversion tool to convert each character in the Chinese medical record into pinyin, and output a fifth feature vector corresponding to the pinyin of each character according to the fifth correspondence rule.
  • the Chinese Pinyin conversion tool is an existing technology, and the python package can be used as the conversion tool.
  • the converted pinyin may not indicate tones, but 1, 2, 3, and 4 may be used to indicate tones. Taking “pi” as an example, the converted pinyin can be "pi” or "pi1".
  • the fifth correspondence rule is specifically: the length of the feature vector is 1; the feature vector corresponds to different pinyin through different vector values.
  • the length of the feature vector is defined as 1, and the initial value of the feature vector is [0].
  • Each pinyin is preset with a corresponding number. Assume “pi” The corresponding number is 20, then replace the initial vector value 0 in the feature vector with this 20, that is, the feature vector corresponding to "pi” is [20].
  • Step 06 According to the splicing rule, splice the first feature vector, the second feature vector, the third feature vector, the fourth feature vector, the fifth feature vector, and the sixth feature vector correspondingly After the initial vector of each character, a vector set for characterizing the Chinese medical record is obtained.
  • the first feature vector, second feature vector, third feature vector, fourth feature vector, and fifth feature vector correspond to [1a], [2b], [3c], [4d], [5e]
  • the final feature vector of the word is [0,1a,2b,3c,4d,5e].
  • step 01 there are two first eigenvectors corresponding to each character in the Chinese medical record, which are [1] representing the gender as male (using the corresponding rule of the second patient type and eigenvector) and representing the age as 78 Years old [78].
  • the rectum and peritoneum is a four-character phrase, and the second feature vectors corresponding to these four words are [1,0,0,0], [0,1,0,0], [0,1, 0,0], [0,0,1,0].
  • step 03 the last three characters in the rectum and peritoneum are all beside the month (" ⁇ "), so the third feature vectors corresponding to these three characters are all [2].
  • the rectum and peritoneum appear in the original medical dictionary and are associated with the entity category of body parts. Therefore, the fourth feature vectors corresponding to these four characters are [0,0,1,0,0,0], [0,0,2,0,0,0], [0,0,2,0,0,0], [0,0,3,0,0,0].
  • step 05 the content contained in the Chinese medical record is converted to pinyin, and the fifth feature vector is obtained according to the corresponding number preset for each pinyin.
  • the content contained in the Chinese medical record is "gender: male...age: 78... ...Rectal and peritoneum"
  • the corresponding numbers for each word in are 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, respectively, and the fifth feature vector corresponding to each word is [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17].
  • the content of the Chinese medical record is "gender: male...rectal peritoneum”.
  • the resulting vector set should be [0,1,78,1,0,0,0,0,0,0,0,0,0,0,7][0,1,78,0,0,1, 0,0,0,0,0,0,0,0,0,8][0,1,78,0,0,0,1,0,0,0,0,0,0,0,9]...
  • Step 07 Input the vector set used to characterize the Chinese medical records into the trained model to extract entities therein.
  • the model mentioned here refers to a deep neural network model, such as a two-way LSTM+CRF, or a traditional machine learning model.
  • a deep neural network model such as a two-way LSTM+CRF, or a traditional machine learning model.
  • the model is trained, the input vector and the corresponding output value are defined for the model.
  • the model can recognize the specific entity feature.
  • the present invention provides an entity recognition device for Chinese medical records.
  • the device 20 can be divided into one or more modules.
  • FIG. 2 shows a structural diagram of an embodiment of the entity recognition device 20 for Chinese medical records.
  • the device 20 can be divided into a first feature vector generation module 201 and a second feature vector generation module. 202, the third feature vector generating module 203, the fourth feature vector generating module 204, the fifth feature vector generating module 205, the vector set generating module 206, and the entity recognition model 207.
  • the following description will specifically introduce the specific functions of the modules 201-207.
  • the first feature vector generating module 201 is configured to identify the personal information contained in the Chinese medical record, and output a first feature vector corresponding to the personal information according to a first correspondence rule. Each character in the Chinese medical record Corresponding to the same first feature vector.
  • the second feature vector generating module 202 is configured to use a word segmentation tool to segment the Chinese medical records, and use the phrase obtained after word segmentation as a unit, and output corresponding to each character according to the second correspondence rule to indicate that each character is in the phrase The second feature vector of the location.
  • the third feature vector generating module 203 is configured to identify the radical of each character in the Chinese medical record, and output a third feature vector corresponding to the radical of each character according to the third correspondence rule.
  • the fourth feature vector generating module 204 is used to perform an n-gram traversal on the Chinese medical record, and match each phrase obtained after the traversal with the preset original medical dictionary, prefix dictionary, and suffix dictionary, and match the result with the first
  • the four-correspondence rule outputs a corresponding fourth feature vector corresponding to each word.
  • the fifth feature vector generating module 205 is configured to use a Chinese pinyin conversion tool to convert each character in the Chinese medical record into pinyin, and output the first pinyin corresponding to the pinyin of each character according to the fifth correspondence rule. Five feature vectors.
  • the vector set generation module 206 is configured to correspondingly splice the first feature vector, the second feature vector, the third feature vector, the fourth feature vector, and the fifth feature vector in accordance with the splicing rule. After the initial vector of each character, a vector set for characterizing the Chinese medical record is obtained.
  • the entity recognition model 207 is used to input a vector set used to characterize the Chinese medical records into a trained model to extract entities therein.
  • the present invention also proposes a computer device.
  • the computer device 2 is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions.
  • it can be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a cabinet server (including an independent server or a server cluster composed of multiple servers).
  • the computer device 2 at least includes, but is not limited to, a memory 21, a processor 22, and a network interface 23 that can communicate with each other through a system bus. among them:
  • the memory 21 includes at least one type of computer-readable storage medium.
  • the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), Static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc.
  • the memory 21 may be an internal storage unit of the computer device 2, for example, a hard disk or a memory of the computer device 2.
  • the memory 21 may also be an external storage device of the computer device 2, for example, a plug-in hard disk, a smart media card (SMC), and a secure digital device equipped on the computer device 2. (Secure Digital, SD) card, Flash Card, etc.
  • the memory 21 may also include both the internal storage unit of the computer device 2 and its external storage device.
  • the memory 21 is generally used to store an operating system and various application software installed in the computer device 2, for example, a computer program used to implement the entity recognition method of the Chinese medical record.
  • the memory 21 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 22 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments.
  • the processor 22 is generally used to control the overall operation of the computer device 2, for example, perform data interaction or communication-related control and processing with the computer device 2.
  • the processor 22 is used to run the program code or processing data stored in the memory 21, for example, to run a computer program for realizing the entity recognition method of the Chinese medical record.
  • the network interface 23 may include a wireless network interface or a wired network interface, and the network interface 23 is generally used to establish a communication connection between the computer device 2 and other computer devices.
  • the network interface 23 is used to connect the computer device 2 with an external terminal through a network, and establish a data transmission channel and a communication connection between the computer device 2 and the external terminal.
  • the network may be Intranet, Internet, Global System of Mobile Communication (GSM), Wideband Code Division Multiple Access (WCDMA), 4G network, 5G Network, Bluetooth (Bluetooth), Wi-Fi and other wireless or wired networks.
  • FIG. 3 only shows the computer device 2 with components 21-23, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.
  • the computer program stored in the memory 21 for implementing the entity recognition method of Chinese medical records can be executed by one or more processors (in this embodiment, the processor 22) to complete the following steps Operation:
  • Step 01 Identify the personal information contained in the Chinese medical record, and output a first feature vector corresponding to the personal information according to a first correspondence rule, and each word in the Chinese medical record corresponds to the same first feature vector ;
  • Step 02 Use a word segmentation tool to segment the Chinese medical record, and use the phrase obtained after segmentation as a unit, and output a second feature vector for representing the position of each character in the phrase according to the second correspondence rule corresponding to each character;
  • Step 03 Identify the radical of each character in the Chinese medical record, and output a third feature vector corresponding to the radical of each character according to the third correspondence rule.
  • Step 04 Perform n-gram traversal on the Chinese medical record, and match each phrase obtained after traversal with the preset original medical dictionary, prefix dictionary, and suffix dictionary, and output each word according to the matching result and the fourth corresponding rule The corresponding fourth feature vector;
  • Step 05 Use a Chinese pinyin conversion tool to convert each character in the Chinese medical record into pinyin, and output a fifth feature vector corresponding to the pinyin of each character according to the fifth corresponding rule;
  • Step 06 According to the splicing rule, splice the first feature vector, the second feature vector, the third feature vector, the fourth feature vector, the fifth feature vector, and the sixth feature vector correspondingly After the initial vector of each character, to obtain a vector set for characterizing the Chinese medical record;
  • Step 07 Input the vector set used to characterize the Chinese medical records into the trained model to extract entities therein.
  • a computer-readable storage medium of the present invention is a non-volatile readable storage medium in which a computer program is stored, and the computer program can be executed by at least one processor to Realize the operation of the entity recognition method or device for Chinese medical records.
  • the computer-readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), Electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc.
  • the computer-readable storage medium may be an internal storage unit of a computer device, such as a hard disk or memory of the computer device.
  • the computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart media card (SMC), and a secure digital (Secure Digital) equipped on the computer device. , SD) card, flash card (Flash Card), etc.
  • the computer-readable storage medium may also include both the internal storage unit and the external storage device of the computer device.
  • the computer-readable storage medium is generally used to store an operating system and various application software installed in a computer device, such as the aforementioned computer program for implementing the entity recognition method for Chinese medical records.
  • the computer-readable storage medium can also be used to temporarily store various types of data that have been output or will be output.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Primary Health Care (AREA)
  • Epidemiology (AREA)
  • Biomedical Technology (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种中文病历的实体识别方法,属于自然语言处理领域。该方法包括以下步骤:根本不同的对应规则输出各种特征向量,包括与所述个人信息对应的第一特征向量、用于表征每个字在词组中位置的第二特征向量、与所述每个字的偏旁对应的第三特征向量、对应每个字输出相应的第四特征向量以及与所述每个字的拼音对应的第五特征向量;再根据拼接规则将各个特征向量对应拼接在每个字的初始向量之后,以得到用于表征所述中文病历的向量集;最后将用于表征所述中文病历的向量集输入训练好的模型以抽取其中的实体。

Description

中文病历的实体识别方法、装置、设备及存储介质
本申请申明享有2019年4月19日递交的申请号为201910316061.1、名称为“中文病历的实体识别方法、装置、设备及存储介质”的中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。
技术领域
本发明涉及自然语言处理领域,涉及一种中文病历的实体识别方法、装置、设备及存储介质。
背景技术
目前对于命名实体识别在病例上的应用的需求很大,比如对病例的查询、搜索、整理等。
发明人意识到,现有基于深度学习的中文命名实体识别的效果很难提升,而且之前都是应用在其他语言上,比如英语。因为深度学习模型的限制和各个语言间语言特性的不同,这使命名实体任务在中文上的应用受到了限制。又因为通用领域、其他领域与医疗领域之间的差异,使其在医疗领域中病例的方向上的应用有所限制。
发明内容
本发明要解决的技术问题是为了克服现有技术中基于深度学习的中文命名实体识别准确率不高的问题,提出了一种中文病历的实体识别方法、装置、设备及存储介质,通过对中文病例中的文本内容抽取相应的特征转换成特征向量,然后将特征向量作为模型的输入,以提高实体识别的准确率。
本发明是通过下述技术方案来解决上述技术问题:
一种中文病历的实体识别方法,包括以下步骤:
识别出所述中文病历中包含的个人信息,根据第一对应规则输出与所述个人信息对应的第一特征向量,所述中文病历中每个字对应相同的所述第一特征向量;
利用分词工具对所述中文病历进行分词,以分词后得到的词组为单位,根据第二对应规则对应每个字输出用于表征每个字在词组中位置的第二特征向量;
识别出所述中文病历中每个字的偏旁,根据第三对应规则对应每个字输出与所述每个字的偏旁对应的第三特征向量;
对所述中文病历做n-gram遍历,将遍历后得到各个词组分别与预设的原始医学词典、前缀词典和后缀词典进行匹配,根据匹配结果与第四对应规则对应每个字输出相应的第四特征向量;
利用中文拼音转换工具将所述中文病历中每个字转换为拼音,根据第五对应规则对应每个字输出与所述每个字的拼音对应的第五特征向量;
根据拼接规则将所述第一特征向量、所述第二特征向量、所述第三特征向量、所述第四特征向量和所述第五特征向量对应拼接在每个字的初始向量之后,以得到用于表征所述中文病历的向量集;
将用于表征所述中文病历的向量集输入训练好的模型以抽取其中的实体。
本发明还公开了一种中文病历的实体识别装置,包括:
第一特征向量生成模块,用于识别出所述中文病历中包含的个人信息,根据第一对应规则输出与所述个人信息对应的第一特征向量,所述中文病历中每个字对应相同的所述第一特征向量;
第二特征向量生成模块,用于利用分词工具对所述中文病历进行分词,以分词后得到的词组为单位,根据第二对应规则对应每个字输出用于表征每个字在词组中位置的第二特征向量;
第三特征向量生成模块,用于识别出所述中文病历中每个字的偏旁,根据第三对应规则对应每个字输出与所述每个字的偏旁对应的第三特征向量;
第四特征向量生成模块,用于对所述中文病历做n-gram遍历,将遍历后得到各个词组分别与预设的原始医学词典、前缀词典和后缀词典进行匹配,根据匹配结果与第四对应规则对应每个字输出相应的第四特征向量;
第五特征向量生成模块,用于利用中文拼音转换工具将所述中文病历中每个字转换为拼音,根据第五对应规则对应每个字输出与所述每个字的拼音对应的第五特征向量;
向量集生成模块,用于根据拼接规则将所述第一特征向量、所述第二特征向量、所述第三特征向量、所述第四特征向量和所述第五特征向量对应拼接在每个字的初始向量之后,以得到用于表征所述中文病历的向量集;
实体识别模型,用于将用于表征所述中文病历的向量集输入训练好的模型以抽取其中的实体。
本发明还公开了一种计算机设备,包括存储器和处理器,所述存储器上存储有计算机程序,所述计算机程序被所述处理器执行时实现如下步骤:
识别出所述中文病历中包含的个人信息,根据第一对应规则输出与所述个人信息对应的第一特征向量,所述中文病历中每个字对应相同的所述第一特征向量;
利用分词工具对所述中文病历进行分词,以分词后得到的词组为单位,根据第二对应规则对应每个字输出用于表征每个字在词组中位置的第二特征向量;
识别出所述中文病历中每个字的偏旁,根据第三对应规则对应每个字输出与所述每个字的偏旁对应的第三特征向量;
对所述中文病历做n-gram遍历,将遍历后得到各个词组分别与预设的原始医学词典、前缀词典和后缀词典进行匹配,根据匹配结果与第四对应规则对应每个字输出相应的第四特征向量;
利用中文拼音转换工具将所述中文病历中每个字转换为拼音,根据第五对应规则对应每个字输出与所述每个字的拼音对应的第五特征向量;
根据拼接规则将所述第一特征向量、所述第二特征向量、所述第三特征向量、所述第四特征向量和所述第五特征向量对应拼接在每个字的初始向量之后,以得到用于表征所述中文病历的向量集;
将用于表征所述中文病历的向量集输入训练好的模型以抽取其中的实体。
本发明还公开了一种计算机可读存储介质,所述计算机可读存储介质内存储有计算机程序,所述计算机程序可被至少一个处理器所执行,以实现如下步骤:
识别出所述中文病历中包含的个人信息,根据第一对应规则输出与所述个人信息对应的第一特征向量,所述中文病历中每个字对应相同的所述第一特征向量;
利用分词工具对所述中文病历进行分词,以分词后得到的词组为单位,根据第二对应规则对应每个字输出用于表征每个字在词组中位置的第二特征向量;
识别出所述中文病历中每个字的偏旁,根据第三对应规则对应每个字输出与所述每个字的偏旁对应的第三特征向量;
对所述中文病历做n-gram遍历,将遍历后得到各个词组分别与预设的原始医学词典、前缀词典和后缀词典进行匹配,根据匹配结果与第四对应规则对应每个字输出相应的第四特征向量;
利用中文拼音转换工具将所述中文病历中每个字转换为拼音,根据第五对应规则对应每个字输出与所述每个字的拼音对应的第五特征向量;
根据拼接规则将所述第一特征向量、所述第二特征向量、所述第三特征向量、所述第四特征向量和所述第五特征向量对应拼接在每个字的初始向量之后,以得到用于表征所述中文病历的向量集;
将用于表征所述中文病历的向量集输入训练好的模型以抽取其中的实体。
本发明通过先识别出中文病历中的实体信息转换成特征向量,然后将中文病历整体转换成的向量集作为模型的输入,以提高模型对实体抽取的准确率。
附图说明
图1示出了本发明中文病历的实体识别方法一实施例的流程图;
图2示出了本发明中文病历的实体识别装置一实施例的结构图;
图3示出了本发明计算机设备一实施例的硬件架构示意图。
具体实施方式
下面通过实施例的方式进一步说明本发明,但并不因此将本发明限制在所述的实施例范围之中。
首先,本发明提出一种中文病历的实体识别方法。
在一实施例中,如图1所示,所述的中文病历的实体识别方法包括如下步骤:
步骤01:识别出所述中文病历中包含的个人信息,根据第一对应规则输出与所述个人信息对应的第一特征向量,所述中文病历中每个字对应相同的所述第一特征向量。
识别个人信息具体可采用正则表达式匹配。所谓正则表达式是对字符串操作的一种逻辑公式,就是用事先定义好的一些特定字符、及这些特定字符的组合,组成一个"规则字符串",这个"规则字符串"用来表达对字符串的一种过滤逻辑。
例如这里要识别个人信息,就要先创建一个用于匹配个人信息的正则表达式,具体的表达式根据使用的计算机程序设计语言而定,每种语言都定义有一套字符表达方法;然后运行创建的正则表达式与所述病人基本信息进行匹配,就可以识别出所述病人基本信息中包含的个人信息了。
这里所述个人信息主要指病人类型和病人年龄,选择识别这些个人信息,是因为基本每份病例中都能体现出病人类型(男、女、青年、老人、青年、儿童、婴儿等)和病人年龄,而对应不同的病人类型和病人年龄,医生可能会采取相对应的治疗手段和检查检验的方式,因此基于病人类型和病人年龄的识别有利于对病历的分析,这里选择从病人基本信息中识别出病人类型和病人年龄这两个特征。
由于病人基本信息包括病人类型和病人年龄两种,因此这里第一对应规则也有两种,分别是病人类型与特征向量的对应规则和病人年龄与特征向量的对应规则。
具体地,病人类型与特征向量的对应规则包括两种:其一,特征向量的长度等于所述病人类型的种类数量;所述特征向量中每一维度对应所述病人类型中的一个种类;所述特征向量通过所述病人类型对应维度的向量值的改变表征对应的所述病人类型。其二,特征向量的长度为1;所述特征向量通过不同的向量值对应表征不同所述病人类型。
下面以病人类型为五种“男、女、老人、婴儿、儿童”为例,说明两种病人类型与特 征向量的对应规则:
对应第一种对应规则,特征向量的长度为5,假设初始的特征向量为[0,0,0,0,0],特征向量中每个维度对应一种病人类型,假设该特征向量中的每个维度从前往后依次对应的病人类型为“男、女、老人、婴儿、儿童”。基于上述对应规则,如果病人基本信息中识别出病人类型为“男”,那么初始的特征向量中对应病人类型“男”的第一维度的特征值由0变为1,即用特征向量[1,0,0,0,0]表示病人类型“男”;如果识别出病人类型为“老人”,那么用特征向量[0,0,1,0,0]来表示。
对应第二种对应规则,特征向量的长度为1,即初始的特征向量为[0],用不同的数字对应五种病人类型,这里假设用数字1,2,3,4,5依次对应病人类型“男、女、老人、婴儿、儿童”。基于上述对应原则,如果病人基本信息中识别出病人类型为“男”,那么初始的特征向量的特征值由0变为1,即用特征向量[1]表示病人类型“男”;如果识别出病人类型为“老人”,那么用特征向量[3]来表示。
病人年龄与特征向量的对应规则具体为:特征向量的长度为1;所述特征向量通过不同的向量值对应表征不同所述病人年龄,所述向量值等于所述病人年龄。
下面以病人年龄为“78岁”为例,具体说明病人年龄与特征向量的对应规则。根据该对应规则,特征向量的长度为1,即初始的特征向量为[0],通过识别到病人年龄为78,那么将初始的特征向量的向量值由0改为78,即用特征向量[78]表示病人年龄为78岁。
所述中文病历中每个字对应相同的所述第一特征向量是指,假设识别出某一中文病历中病人为男性,年龄为78岁,该中文病历中每个字的特征向量中均包含表征性别为男的第一特征向量[1,0,0,0,0]或[1]以及表征年龄为78岁的第一特征向量[78]。
步骤02:利用分词工具对所述中文病历进行分词,以分词后得到的词组为单位,根据第二对应规则对应每个字输出用于表征每个字在词组中位置的第二特征向量。
由于是针对中文病历,因此分词工具也对应采用中文分词工具,这里所述分词工具都是现有的,常见的有jieba、SnowNLP、THULAC、NLPIR等,不再详述。
通过分词把句子中的单个字、词区分出来,也可以把标点分出来,以便后续实体的识别。
利用分词工具对病历进行分词,以对病历内容“直肠腹膜返折上方未及肿块,结合术前肠镜及病理术中诊断为直肠肛管癌,决定行Miles术”分词为例,经过分词得到“直肠腹膜/返折/上方/未及/肿块/,/结合/术前/肠镜/及/病理/术中诊断/为/直肠肛管癌/,/决定/行/Miles术/”。
所述第二对应规则具体为:特征向量的长度为4;所述特征向量的前三个维度用于表征包含两个字以上词组,其中第一个维度的向量值的改变用于表征位于所述词组中首位的字, 第二个维度的向量值的改变用于表征位于所述词组中中间的字,第三个维度的向量值的改变用于表征位于所述词组中末尾的字;所述特征向量的第四个维度用于表征单字词组,所述第四个维度的向量值的改变用于表征单字词组中的字。
以四字词组“直肠腹膜”为例,每个字对应有一个初始的特征向量,根据第一对应规则,特征向量的长度为4,因此这里每个字对应的初始的特征向量均为[0,0,0,0]。由于词组“直肠腹膜”为四字词组,因此仅用到特征向量的前三个维度。“直”位于该词组的首位,则相应改变初始的特征向量中第一个维度的向量值(由0改为1),即“直”的特征向量为[1,0,0,0];“肠”和“腹”均位于该词组的中间位置,因此这两个字的特征向量相同,都相应改变初始的特征向量中第二个维度的向量值(由0改为1),即“肠”和“腹”的特征向量均为[0,1,0,0];“膜”位于该词组的末位,则相应改变初始的特征向量中第三个维度的向量值(由0改为1),即“膜”的特征向量为[0,0,1,0]。
再以单字词组“及”为例,该字对应的初始的特征向量也为[0,0,0,0],由于是单子词组,仅用到特征向量的第四个维度,即改变初始的特征向量中第四个维度的向量值(由0改为1),得到“及”的特征向量为[0,0,0,1]。
步骤03:识别出所述中文病历中每个字的偏旁,根据第三对应规则对应每个字输出与所述每个字的偏旁对应的第三特征向量。
首先,通过将所述中文病历中的每个字与预设的偏旁字典进行匹配,输出匹配到的偏旁,从而识别出每个字的偏旁,其中所述偏旁字典包含所有中文字与对应偏旁的关联关系。具体地,预设有一个包含有所有中文字与对应偏旁的关联关系的偏旁字典,将所述中文病历中的每个字与预设的偏旁字典进行匹配,输出匹配到的偏旁。
所述第三对应规则具体包括两种:其一,特征向量的长度等于预设的实体偏旁的数量;所述特征向量中每一维度对应一个所述实体偏旁;所述特征向量通过所述实体偏旁对应维度的向量值的改变表征对应的包含所述实体偏旁的字;其二,特征向量的长度为1;所述特征向量通过不同的向量值对应表征包含不同所述实体偏旁的字。
实体偏旁根据具体需要进行预设,比如预设最有效的两种实体偏旁分别是病字框(“疒”)和月字旁(“月”),也可以添加其他偏旁作为实体偏旁,比如(竹字头“”、骨字旁“骨”)等。
以预设的实体偏旁包括病字框(“疒”)和月字旁(“月”)两种为例,具体说明两种对应规则。
对应第一种对应规则,特征向量的长度为2,对应初始的特征向量为[0,0],其中第一个维度的向量值的改变用于表征病字框(“疒”),第二个维度的向量值的改变用于表征月字旁(“月”)。那么,针对“直肠腹膜”中的后三个字都是月字旁(“月”),月字旁(“月”) 属于预设的实体偏旁,因此相应的第二特征向量均是相同的[0,1],针对“直肠腹膜”中的第一个字“直”既不是月字旁(“月”)也不是病字框(“疒”),不属于预设的实体偏旁,那么该字的第二特征向量即为初始特征向量[0,0];而针对“病理”中的“病”字为病字框(“疒”),那么相应的第二特征向量为[1,0]。
对应第二种对应规则,由于特征向量的长度固定为1,对应初始的特征向量为[0],用不同的向量值1,2分别对应表征病字框(“疒”)和月字旁(“月”)。那么,针对“直肠腹膜”中的后三个字都是月字旁(“月”),相应的第二特征向量均是相同的[2],针对“直肠腹膜”中的第一个字“直”既不是月字旁(“月”)也不是病字框(“疒”),那么该字的第二特征向量即为初始特征向量[0];而针对“病理”中的“病”字为病字框(“疒”),那么相应的第二特征向量为[1]。
若预设的实体偏旁还包括竹字头(“”)和骨字旁(“骨”),对应第一种对应规则时,特征向量的长度即为4,初始的特征向量为[0,0,0,0];对应第二种对应规则是,初始的特征向量为[0],用不同的向量值1,2,3,4分别对应表征病字框(“疒”)、月字旁(“月”)、竹字头(“”)和骨字旁(“骨”),还可以预设更多的实体偏旁,依次类推,不再赘述。
步骤04:对所述中文病历做n-gram遍历,将遍历后得到各个词组分别与预设的原始医学词典、前缀词典和后缀词典进行匹配,根据匹配结果与第四对应规则对应每个字输出相应的第四特征向量。
遍历之前需要对中文病历预处理,通常需要将标点符合去除。
n取值等于或小于所述中文病历长度的非零自然数。优选地,n一般取原始医学字典中最长的词组的长度。对所有小于n的自然数都输入病例一次。以n=5为例,需要对中文病历先后做5-gram遍历、4-gram遍历、3-gram遍历、2-gram遍历、1-gram遍历。
所谓n-gram遍历是自然语言处理的常用方法,实际上就是分词,n为分词后每个词组的长度,从第一个字开始取前5个字组成词组,从第二个字开始取前5个字组成词组,依次类推,即每个词组的首字为中文病历中的第i个字,每个词组的末字为中文病历中的第i+n-1个字,1≤i=≤(中文病历长度-n+1)。
以中文病历内容“直肠腹膜返折上方未及肿块”为例,经3-gram遍历后得到的结果为“直肠腹”、“肠腹膜”、“腹膜返”、“膜返折”、“返折上”、“折上方”、“上方未”、“方未及”、“未及肿、“及肿块”。
所述原始医学词典可以选用现有的任意一种医疗术语词典。在原始医学词典中,每个词组都对应一种实体类别,根据该词组分出的若干个词组在前缀词典和后缀词典的实体类别沿用该词组在原始医学词典中对应的实体类别。
所述前缀词典的构建具体为:识别出所述原始医学词典中多于两个字的词组,依次将 识别出的词组的前i个字存入前缀词典,i为小于该词组长度且大于该词组长度的一半的自然数,其中该词组长度的一半取整数。
以原始医学词典中的词组“左侧粗隆间骨折”为例,该词组的长度为7,该词组长度的一半为3.5,取整数为3,因此i的取值为4、5、6,i=6对应“左侧粗隆间骨”,i=5对应“左侧粗隆间”,i=4对应“左侧粗隆”,因此根据词组“左侧粗隆间骨折”构建的前缀词典包含“左侧粗隆间骨”、“左侧粗隆间”、“左侧粗隆”。
所述后缀词典的构建具体为:识别出所述原始医学词典中多于两个字的词组;将识别出的所述词组的后i个字存入后缀词典,i为小于该词组长度且大于等于该词组长度的一半的自然数,其中该词组长度的一半取整数。
以原始医学词典中的词组“左侧粗隆间骨折”为例,该词组的长度为7,该词组长度的一半为3.5,取整数为3,因此i的取值为3、4、5、6,i=6对应“侧粗隆间骨折”,i=5对应“粗隆间骨折”,i=4对应“隆间骨折”,i=3对应“间骨折”,因此根据词组“左侧粗隆间骨折”构建的后缀词典包含“侧粗隆间骨折”、“粗隆间骨折”、“隆间骨折”、“间骨折”。
匹配时,三个词典(原始医学词典、前缀词典和后缀词典)可以同时进行匹配,也可以设定匹配顺序。无论采用哪种匹配方法,只要词组匹配到其中一个词典后,就停止与其他两个词典的匹配。这里的匹配要求为完全匹配。匹配结果包括匹配和不匹配两种。当匹配结果为匹配时,所述匹配结果包含有匹配的词典名词和匹配的医学名词及该医学名词对应的实体类别
第四特征向量用于区分实体类别,特征向量的长度等实体类别的数量,例如共分为六类实体,分别代表疾病和诊断、症状和体征、身体部位、检查和检验、手术、药物,那么对应特征向量的长度为6,则初始的特征向量即为[0,0,0,0,0,0],每个维度的向量值对应一个实体类别。
当匹配结果为不匹配时,对应每个字输出初始的特征向量。
当匹配结果为匹配时,则根据相匹配到的词组的实体类别,将该实体类别对应维度的向量值改变,根据与不同词典中的词组相匹配,改变规则有所不同,因此需要根据所述匹配结果中包含的词典名词调用适用的对应规则,具体如下:
当与原始医学词典相匹配时,适用的对应规则具体为:特征向量的长度等于所述实体类别的数量;所述特征向量中每一维度对应一个所述实体类别;所述特征向量通过将初始向量值改为第一向量值、第二向量值或第三向量值对应表征单个字在词组中的首位、中间位或末位。
下面以词组“直肠腹膜”为例做具体说明,其中第一向量值、第二向量值和第三向量 值分别取1,2,3。该词组出现在原始医学词典,且与身体部位这一实体类别相关联,因此这四个字均要改变第三维度的向量值。再根据各个字在词组中的具体位置,“直”位于该词组的首位,将第三维度的向量值由0改为1,即“直”的特征向量为[0,0,1,0,0,0];“肠”和“腹”均位于该词组的中间位置,因此这两个字的特征向量相同,均将第三维度的向量值由0改为2,即“肠”和“腹”的特征向量均为[0,0,2,0,0,0];“膜”位于该词组的末位,将第三维度的向量值由0改为3,即“膜”的特征向量为[0,0,3,0,0,0]。
当与前缀词典相匹配时,适用的对应规则具体为:特征向量的长度等于所述实体类别的数量;所述特征向量中每一维度对应一个所述实体类别;所述特征向量通过将初始向量值改为第一向量值或第二向量值对应表征单个字在词组中的首位或非首位。
下面以词组“血细胞”为例做具体说明,其中第一向量值和第二向量值分别取1,2。该词组出现在前缀词典中,且与检查和检验这一实体类别相关联,因此这三个字均要改变第四维度的向量值。再根据各个字在词组中的具体位置,“血”位于该词组的首位,将第四维度的向量值由0改为1,即“血”的特征向量为[0,0,0,1,0,0];“细”和“胞”均位于该词组的非首位,均将第四维度的向量值由0改为2,即“细”和“胞”的特征向量均为[0,0,0,2,0,0]。
当与后缀词典相匹配时,适用的对应规则具体为:特征向量的长度等于所述实体类别的数量;所述特征向量中每一维度对应一个所述实体类别;所述特征向量通过将初始向量值改为第二向量值或第三向量值对应表征单个字在词组中的非末位或末位。
下面以词组“彩超”为例做具体说明,其中第二向量值和第三向量值分别取2,3。该词组出现在后缀词典中,且与检查和检验这一实体类别相关联,因此这三个字要改变第四维度的向量值。再根据各个字在词组中的具体位置,“彩”位于该词组的非末位,将第四维度的向量值由0改为2,即“彩”的特征向量为[0,0,0,2,0,0];“超”位于该词组的末位,将第四维度的向量值由0改为3,即“超”的特征向量为[0,0,0,3,0,0]。
需要注意的是,经过n-gram遍历,会将一段语句按不同的字数划分多次,因此每个字会得到n个特征向量,但这个特征向量只会有两种可能,要么不匹配输出的初始的特征向量,要么匹配输出的相应的特征向量(各次遍历下匹配输出的相应的特征向量是相同的)。只要有一次匹配,则最终对应该字输出的是该字相应的特征向量,除非每次都不匹配,则最终对应该字输出的是初始的特征向量。
步骤05:利用中文拼音转换工具将所述中文病历中每个字转换为拼音,根据第五对应规则对应每个字输出与所述每个字的拼音对应的第五特征向量。
所述中文拼音转换工具为现有技术,可以用python包作为转换工具使用。转化后的拼音可以不表示声调,也可以用1、2、3、4表示声调,以“匹”为例,转换成拼音可以是“pi”, 也可以是“pi1”。
所述第五对应规则具体为:特征向量的长度为1;所述特征向量通过不同的向量值对应表征不同所述拼音。
以“匹”的拼音“pi”为例,根据对应规则,特征向量的长度定义为1,特征向量的初始值为[0],每个拼音都预设有对应的数字编号,假设“pi”对应的数字编号为20,那么用该20替换特征向量中的初始向量值0,即“pi”对应的特征向量为[20]。
步骤06:根据拼接规则将所述第一特征向量、所述第二特征向量、所述第三特征向量、所述第四特征向量、所述第五特征向量和所述第六特征向量对应拼接在每个字的初始向量之后,以得到用于表征所述中文病历的向量集。
假设某个字的初始向量为[0],第一特征向量、第二特征向量、第三特征向量、第四特征向量和第五特征向量分别对应[1a]、[2b]、[3c]、[4d]、[5e],那么该字最终的特征向量为[0,1a,2b,3c,4d,5e]。
如果还有别的特征向量,还可以往后拼接,不限数量;此外,拼接顺序不限于上述所示。
以中文病历中包含如下内容“性别:男……年龄:78……直肠腹膜……”为例,对各特征向量的拼接做进一步说明。
根据步骤01,该中文病历中每个字对应的第一特征向量有两个,分别是表征性别为男性的[1](采用第二种病人类型与特征向量的对应规则)和表征年龄为78岁的[78]。
根据步骤02,直肠腹膜为一个四字词组,这四个字对应的第二特征向量分别为[1,0,0,0]、[0,1,0,0]、[0,1,0,0]、[0,0,1,0]。
根据步骤03,直肠腹膜中后面三个字均为月字旁(“月”),因此这三个字对应的第三特征向量均为[2]。
根据步骤04,直肠腹膜出现在原始医学词典,且与身体部位这一实体类别相关联,因此这四个字对应的第四特征向量分别为[0,0,1,0,0,0]、[0,0,2,0,0,0]、[0,0,2,0,0,0]、[0,0,3,0,0,0]。
根据步骤05,中文病历中的包含的内容转换为拼音,再根据每个拼音预设的对应数字编号得到第五特征向量,这里假设中文病历中包含的内容“性别:男……年龄:78……直肠腹膜……”中每个字对应的数字编号分别为7、8、9、10、11、12、13、14、15、16、17,则对应每个字的第五特征向量分别为[7]、[8]、[9]、[10]、[11]、[12]、[13]、[14]、[15]、[16]、[17]。
假设拼接顺序为第一特征向量至第五特征向量依次拼接,且每个字的初始向量均为[0],那么针对该中文病历中的内容“性别:男……直肠腹膜”这几个字得到的向量集应为 [0,1,78,1,0,0,0,0,0,0,0,0,0,0,7][0,1,78,0,0,1,0,0,0,0,0,0,0,0,8][0,1,78,0,0,0,1,0,0,0,0,0,0,0,9]……[0,1,78,1,0,0,0,0,0,0,1,0,0,0,14][0,1,78,0,1,0,0,2,0,0,2,0,0,0,15][0,1,78,0,1,0,0,2,0,0,2,0,0,0,16][0,1,78,0,0,1,0,2,0,0,3,0,0,0,17]。
步骤07:将用于表征所述中文病历的向量集输入训练好的模型以抽取其中的实体。
这里所述模型是指深度神经网络模型,例如双向LSTM+CRF,也可以是传统的机器学习模型。模型训练时,给模型定义输入向量和对应的输出值,模型经过训练后,一旦识别到输入的向量集中包含某段特定的向量值,模型就可以识别出特定的实体特征。例如输入向量集[0,1,78,1,0,0,0,0,0,0,0,0,0,0,7][0,1,78,0,0,1,0,0,0,0,0,0,0,0,8][0,1,78,0,0,0,1,0,0,0,0,0,0,0,9]……[0,1,78,1,0,0,0,0,0,0,1,0,0,0,14][0,1,78,0,1,0,0,2,0,0,2,0,0,0,15][0,1,78,0,1,0,0,2,0,0,2,0,0,0,16][0,1,78,0,0,1,0,2,0,0,3,0,0,0,17],根据不同维度上的向量值,可以识别出这向量集表征的是一78岁男性的中文病历,并且最后四个特征向量表征的是一个身体部位的四字词组。
本实施例通过先识别出中文病历中的实体转换成特征向量,最终生成向量集作为模型的输入,可以有效提高模型对实体提取的准确率。
其次,本发明提出了一种中文病历的实体识别装置,所述装置20可以被分割为一个或者多个模块。
例如,图2示出了所述中文病历的实体识别装置20一实施例的结构图,该实施例中,所述装置20可以被分割为第一特征向量生成模块201、第二特征向量生成模块202、第三特征向量生成模块203、第四特征向量生成模块204、第五特征向量生成模块205、向量集生成模块206和实体识别模型207。以下描述将具体介绍所述模块201-207的具体功能。
所述第一特征向量生成模块201,用于识别出所述中文病历中包含的个人信息,根据第一对应规则输出与所述个人信息对应的第一特征向量,所述中文病历中每个字对应相同的所述第一特征向量。
所述第二特征向量生成模块202用于利用分词工具对所述中文病历进行分词,以分词后得到的词组为单位,根据第二对应规则对应每个字输出用于表征每个字在词组中位置的第二特征向量。
所述第三特征向量生成模块203用于识别出所述中文病历中每个字的偏旁,根据第三对应规则对应每个字输出与所述每个字的偏旁对应的第三特征向量。
所述第四特征向量生成模块204用于对所述中文病历做n-gram遍历,将遍历后得到各 个词组分别与预设的原始医学词典、前缀词典和后缀词典进行匹配,根据匹配结果与第四对应规则对应每个字输出相应的第四特征向量。
所述第五特征向量生成模块205用于利用中文拼音转换工具将所述中文病历中每个字转换为拼音,根据第五对应规则对应每个字输出与所述每个字的拼音对应的第五特征向量。
所述向量集生成模块206用于根据拼接规则将所述第一特征向量、所述第二特征向量、所述第三特征向量、所述第四特征向量和所述第五特征向量对应拼接在每个字的初始向量之后,以得到用于表征所述中文病历的向量集。
所述实体识别模型207用于将用于表征所述中文病历的向量集输入训练好的模型以抽取其中的实体。
再次,本发明还提出来一种计算机设备。
参阅图3所示,是本发明计算机设备一实施例的硬件架构示意图。本实施例中,所述计算机设备2是一种能够按照事先设定或者存储的指令,自动进行数值计算和/或信息处理的设备。例如,可以是智能手机、平板电脑、笔记本电脑、台式计算机、机架式服务器、刀片式服务器、塔式服务器或机柜式服务器(包括独立的服务器,或者多个服务器所组成的服务器集群)等。如图所示,所述计算机设备2至少包括,但不限于,可通过系统总线相互通信连接存储器21、处理器22以及网络接口23。其中:
所述存储器21至少包括一种类型的计算机可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器21可以是所述计算机设备2的内部存储单元,例如该计算机设备2的硬盘或内存。在另一些实施例中,所述存储器21也可以是所述计算机设备2的外部存储设备,例如该计算机设备2上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器21还可以既包括所述计算机设备2的内部存储单元也包括其外部存储设备。本实施例中,所述存储器21通常用于存储安装于所述计算机设备2的操作系统和各类应用软件,例如用于实现所述中文病历的实体识别方法的计算机程序等。此外,所述存储器21还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器22在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器22通常用于控制所述计算机设备2的总体操作,例如执行与所述计算机设备2进行数据交互或者通信相关的控制和 处理等。本实施例中,所述处理器22用于运行所述存储器21中存储的程序代码或者处理数据,例如运行用于实现所述中文病历的实体识别方法的计算机程序等。
所述网络接口23可包括无线网络接口或有线网络接口,该网络接口23通常用于在所述计算机设备2与其他计算机设备之间建立通信连接。例如,所述网络接口23用于通过网络将所述计算机设备2与外部终端相连,在所述计算机设备2与外部终端之间的建立数据传输通道和通信连接等。所述网络可以是企业内部网(Intranet)、互联网(Internet)、全球移动通讯系统(Global System of Mobile communication,GSM)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、4G网络、5G网络、蓝牙(Bluetooth)、Wi-Fi等无线或有线网络。
需要指出的是,图3仅示出了具有组件21-23的计算机设备2,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
在本实施例中,存储于存储器21中的用于实现所述中文病历的实体识别方法的计算机程序可以被一个或多个处理器(本实施例为处理器22)所执行,以完成以下步骤的操作:
步骤01:识别出所述中文病历中包含的个人信息,根据第一对应规则输出与所述个人信息对应的第一特征向量,所述中文病历中每个字对应相同的所述第一特征向量;
步骤02:利用分词工具对所述中文病历进行分词,以分词后得到的词组为单位,根据第二对应规则对应每个字输出用于表征每个字在词组中位置的第二特征向量;
步骤03:识别出所述中文病历中每个字的偏旁,根据第三对应规则对应每个字输出与所述每个字的偏旁对应的第三特征向量;
步骤04:对所述中文病历做n-gram遍历,将遍历后得到各个词组分别与预设的原始医学词典、前缀词典和后缀词典进行匹配,根据匹配结果与第四对应规则对应每个字输出相应的第四特征向量;
步骤05:利用中文拼音转换工具将所述中文病历中每个字转换为拼音,根据第五对应规则对应每个字输出与所述每个字的拼音对应的第五特征向量;
步骤06:根据拼接规则将所述第一特征向量、所述第二特征向量、所述第三特征向量、所述第四特征向量、所述第五特征向量和所述第六特征向量对应拼接在每个字的初始向量之后,以得到用于表征所述中文病历的向量集;
步骤07:将用于表征所述中文病历的向量集输入训练好的模型以抽取其中的实体。
此外,本发明一种计算机可读存储介质,所述计算机可读存储介质为非易失性可读存储介质,其内存储有计算机程序,所述计算机程序可被至少一个处理器所执行,以实现上述中文病历的实体识别方法或装置的操作。
其中,计算机可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,计算机可读存储介质可以是计算机设备的内部存储单元,例如该计算机设备的硬盘或内存。在另一些实施例中,计算机可读存储介质也可以是计算机设备的外部存储设备,例如该计算机设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,计算机可读存储介质还可以既包括计算机设备的内部存储单元也包括其外部存储设备。本实施例中,计算机可读存储介质通常用于存储安装于计算机设备的操作系统和各类应用软件,例如前述用于实现所述中文病历的实体识别方法的计算机程序等。此外,计算机可读存储介质还可以用于暂时地存储已经输出或者将要输出的各类数据。
虽然以上描述了本发明的具体实施方式,但是本领域的技术人员应当理解,这仅是举例说明,本发明的保护范围是由所附权利要求书限定的。本领域的技术人员在不背离本发明的原理和实质的前提下,可以对这些实施方式做出多种变更或修改,但这些变更和修改均落入本发明的保护范围。

Claims (20)

  1. 一种中文病历的实体识别方法,包括以下步骤:
    识别出所述中文病历中包含的个人信息,根据第一对应规则输出与所述个人信息对应的第一特征向量,所述中文病历中每个字对应相同的所述第一特征向量;
    利用分词工具对所述中文病历进行分词,以分词后得到的词组为单位,根据第二对应规则对应每个字输出用于表征每个字在词组中位置的第二特征向量;
    识别出所述中文病历中每个字的偏旁,根据第三对应规则对应每个字输出与所述每个字的偏旁对应的第三特征向量;
    对所述中文病历做n-gram遍历,将遍历后得到各个词组分别与预设的原始医学词典、前缀词典和后缀词典进行匹配,根据匹配结果与第四对应规则对应每个字输出相应的第四特征向量;
    利用中文拼音转换工具将所述中文病历中每个字转换为拼音,根据第五对应规则对应每个字输出与所述每个字的拼音对应的第五特征向量;
    根据拼接规则将所述第一特征向量、所述第二特征向量、所述第三特征向量、所述第四特征向量和所述第五特征向量对应拼接在每个字的初始向量之后,以得到用于表征所述中文病历的向量集;
    将用于表征所述中文病历的向量集输入训练好的模型以抽取其中的实体。
  2. 根据权利要求1所述的中文病历的实体识别方法,所述识别出所述中文病历中包含的个人信息包括以下步骤:创建用于匹配个人信息的正则表达式;用所述正则表达式与所述病人基本信息进行匹配,以识别出所述病人基本信息中包含的个人信息;所述第一对应规则包括病人类型与特征向量的对应规则和病人年龄与特征向量的对应规则;
    所述病人类型与特征向量的对应规则包括:特征向量的长度等于所述病人类型的种类数量;所述特征向量中每一维度对应所述病人类型中的一个种类;所述特征向量通过所述病人类型对应维度的向量值的改变表征对应的所述病人类型;
    或者,特征向量的长度为1;所述特征向量通过不同的向量值对应表征不同所述病人类型;
    所述病人年龄与特征向量的对应规则包括:特征向量的长度为1;所述特征向量通过不同的向量值对应表征不同所述病人年龄,所述向量值等于所述病人年龄。
  3. 根据权利要求1所述的中文病历的实体识别方法,所述第二对应规则包括:特征向量的长度为4;所述特征向量的前三个维度用于表征包含两个字以上词组, 其中第一个维度的向量值的改变用于表征位于所述词组中首位的字,第二个维度的向量值的改变用于表征位于所述词组中中间的字,第三个维度的向量值的改变用于表征位于所述词组中末尾的字;所述特征向量的第四个维度用于表征单字词组,所述第四个维度的向量值的改变用于表征单字词组中的字。
  4. 根据权利要求1所述的中文病历的实体识别方法,所述识别出所述中文病历中每个字的偏旁具体包括以下步骤:将所述中文病历中的每个字与预设的偏旁字典进行匹配,输出匹配到的偏旁,所述偏旁字典包含所有中文字与对应偏旁的关联关系;
    所述第三对应规则包括:特征向量的长度等于预设的实体偏旁的数量;所述特征向量中每一维度对应一个所述实体偏旁;所述特征向量通过所述实体偏旁对应维度的向量值的改变表征对应的包含所述实体偏旁的字;
    或者,特征向量的长度为1;所述特征向量通过不同的向量值对应表征包含不同所述实体偏旁的字。
  5. 根据权利要求1所述的中文病历的实体识别方法,所述前缀词典的构建包括以下步骤:识别出所述原始医学词典中多于两个字的词组;将识别出的所述词组的前i个字存入前缀词典,i为小于该词组长度且大于该词组长度的一半的自然数,其中该词组长度的一半取整数;
    所述后缀词典的构建包括以下步骤:识别出所述原始医学词典中多于两个字的词组;将识别出的所述词组的后i个字存入后缀词典,i为小于该词组长度且大于等于该词组长度的一半的自然数,其中该词组长度的一半取整数。
  6. 根据权利要求1所述的中文病历的实体识别方法,所述第四对应规则包括与原始医学词典相匹配时适用的对应规则、与前缀词典相匹配时适用的对应规则以及与后缀词典相匹配时适用的对应规则;
    所述与原始医学词典相匹配时适用的对应规则包括:特征向量的长度等于所述实体类别的数量;所述特征向量中每一维度对应一个所述实体类别;所述特征向量通过将初始向量值改为第一向量值、第二向量值或第三向量值对应表征单个字在词组中的首位、中间位或末位;
    所述与前缀词典相匹配时适用的对应规则包括:特征向量的长度等于所述实体类别的数量;所述特征向量中每一维度对应一个所述实体类别;所述特征向量通过将初始向量值改为第一向量值或第二向量值对应表征单个字在词组中的首位或非首位;
    所述与后缀词典相匹配时适用的对应规则包括:特征向量的长度等于所述实体 类别的数量;所述特征向量中每一维度对应一个所述实体类别;所述特征向量通过将初始向量值改为第二向量值或第三向量值对应表征单个字在词组中的非末位或末位。
  7. 根据权利要求1所述的中文病历的实体识别方法,第五对应规则包括:特征向量的长度为1;所述特征向量通过不同的向量值对应表征不同所述拼音。
  8. 一种中文病历的实体识别装置,包括:
    第一特征向量生成模块,用于识别出所述中文病历中包含的个人信息,根据第一对应规则输出与所述个人信息对应的第一特征向量,所述中文病历中每个字对应相同的所述第一特征向量;
    第二特征向量生成模块,用于利用分词工具对所述中文病历进行分词,以分词后得到的词组为单位,根据第二对应规则对应每个字输出用于表征每个字在词组中位置的第二特征向量;
    第三特征向量生成模块,用于识别出所述中文病历中每个字的偏旁,根据第三对应规则对应每个字输出与所述每个字的偏旁对应的第三特征向量;
    第四特征向量生成模块,用于对所述中文病历做n-gram遍历,将遍历后得到各个词组分别与预设的原始医学词典、前缀词典和后缀词典进行匹配,根据匹配结果与第四对应规则对应每个字输出相应的第四特征向量;
    第五特征向量生成模块,用于利用中文拼音转换工具将所述中文病历中每个字转换为拼音,根据第五对应规则对应每个字输出与所述每个字的拼音对应的第五特征向量;
    向量集生成模块,用于根据拼接规则将所述第一特征向量、所述第二特征向量、所述第三特征向量、所述第四特征向量和所述第五特征向量对应拼接在每个字的初始向量之后,以得到用于表征所述中文病历的向量集;
    实体识别模型,用于将用于表征所述中文病历的向量集输入训练好的模型以抽取其中的实体。
  9. 一种计算机设备,包括存储器和处理器,所述存储器上存储有计算机程序,所述计算机程序被所述处理器执行时实现如下步骤:
    识别出所述中文病历中包含的个人信息,根据第一对应规则输出与所述个人信息对应的第一特征向量,所述中文病历中每个字对应相同的所述第一特征向量;
    利用分词工具对所述中文病历进行分词,以分词后得到的词组为单位,根据第二对应规则对应每个字输出用于表征每个字在词组中位置的第二特征向量;
    识别出所述中文病历中每个字的偏旁,根据第三对应规则对应每个字输出与所 述每个字的偏旁对应的第三特征向量;
    对所述中文病历做n-gram遍历,将遍历后得到各个词组分别与预设的原始医学词典、前缀词典和后缀词典进行匹配,根据匹配结果与第四对应规则对应每个字输出相应的第四特征向量;
    利用中文拼音转换工具将所述中文病历中每个字转换为拼音,根据第五对应规则对应每个字输出与所述每个字的拼音对应的第五特征向量;
    根据拼接规则将所述第一特征向量、所述第二特征向量、所述第三特征向量、所述第四特征向量和所述第五特征向量对应拼接在每个字的初始向量之后,以得到用于表征所述中文病历的向量集;
    将用于表征所述中文病历的向量集输入训练好的模型以抽取其中的实体。
  10. 根据权利要求9所述的计算机设备,所述识别出所述中文病历中包含的个人信息包括以下步骤:创建用于匹配个人信息的正则表达式;用所述正则表达式与所述病人基本信息进行匹配,以识别出所述病人基本信息中包含的个人信息;
    所述第一对应规则包括病人类型与特征向量的对应规则和病人年龄与特征向量的对应规则;
    所述病人类型与特征向量的对应规则包括:特征向量的长度等于所述病人类型的种类数量;所述特征向量中每一维度对应所述病人类型中的一个种类;所述特征向量通过所述病人类型对应维度的向量值的改变表征对应的所述病人类型;
    或者,特征向量的长度为1;所述特征向量通过不同的向量值对应表征不同所述病人类型;
    所述病人年龄与特征向量的对应规则包括:特征向量的长度为1;所述特征向量通过不同的向量值对应表征不同所述病人年龄,所述向量值等于所述病人年龄。
  11. 根据权利要求9所述的计算机设备,所述第二对应规则包括:特征向量的长度为4;所述特征向量的前三个维度用于表征包含两个字以上词组,其中第一个维度的向量值的改变用于表征位于所述词组中首位的字,第二个维度的向量值的改变用于表征位于所述词组中中间的字,第三个维度的向量值的改变用于表征位于所述词组中末尾的字;所述特征向量的第四个维度用于表征单字词组,所述第四个维度的向量值的改变用于表征单字词组中的字;
    所述第五对应规则包括:特征向量的长度为1;所述特征向量通过不同的向量值对应表征不同所述拼音。
  12. 根据权利要求9所述的计算机设备,所述识别出所述中文病历中每个字的偏旁具体包括以下步骤:
    将所述中文病历中的每个字与预设的偏旁字典进行匹配,输出匹配到的偏旁,所述偏旁字典包含所有中文字与对应偏旁的关联关系;
    所述第三对应规则包括:特征向量的长度等于预设的实体偏旁的数量;所述特征向量中每一维度对应一个所述实体偏旁;所述特征向量通过所述实体偏旁对应维度的向量值的改变表征对应的包含所述实体偏旁的字;
    或者,特征向量的长度为1;所述特征向量通过不同的向量值对应表征包含不同所述实体偏旁的字。
  13. 根据权利要求9所述的计算机设备,所述前缀词典的构建包括以下步骤:识别出所述原始医学词典中多于两个字的词组;将识别出的所述词组的前i个字存入前缀词典,i为小于该词组长度且大于该词组长度的一半的自然数,其中该词组长度的一半取整数;
    所述后缀词典的构建包括以下步骤:识别出所述原始医学词典中多于两个字的词组;将识别出的所述词组的后i个字存入后缀词典,i为小于该词组长度且大于等于该词组长度的一半的自然数,其中该词组长度的一半取整数。
  14. 根据权利要求9所述的计算机设备,所述第四对应规则包括与原始医学词典相匹配时适用的对应规则、与前缀词典相匹配时适用的对应规则以及与后缀词典相匹配时适用的对应规则;
    所述与原始医学词典相匹配时适用的对应规则包括:特征向量的长度等于所述实体类别的数量;所述特征向量中每一维度对应一个所述实体类别;所述特征向量通过将初始向量值改为第一向量值、第二向量值或第三向量值对应表征单个字在词组中的首位、中间位或末位;
    所述与前缀词典相匹配时适用的对应规则包括:特征向量的长度等于所述实体类别的数量;所述特征向量中每一维度对应一个所述实体类别;所述特征向量通过将初始向量值改为第一向量值或第二向量值对应表征单个字在词组中的首位或非首位;
    所述与后缀词典相匹配时适用的对应规则包括:特征向量的长度等于所述实体类别的数量;所述特征向量中每一维度对应一个所述实体类别;所述特征向量通过将初始向量值改为第二向量值或第三向量值对应表征单个字在词组中的非末位或末位。
  15. 一种非易失性计算机可读存储介质,所述计算机可读存储介质内存储有计算机程序,所述计算机程序可被至少一个处理器所执行,以实现如下步骤:
    识别出所述中文病历中包含的个人信息,根据第一对应规则输出与所述个人信 息对应的第一特征向量,所述中文病历中每个字对应相同的所述第一特征向量;
    利用分词工具对所述中文病历进行分词,以分词后得到的词组为单位,根据第二对应规则对应每个字输出用于表征每个字在词组中位置的第二特征向量;
    识别出所述中文病历中每个字的偏旁,根据第三对应规则对应每个字输出与所述每个字的偏旁对应的第三特征向量;
    对所述中文病历做n-gram遍历,将遍历后得到各个词组分别与预设的原始医学词典、前缀词典和后缀词典进行匹配,根据匹配结果与第四对应规则对应每个字输出相应的第四特征向量;
    利用中文拼音转换工具将所述中文病历中每个字转换为拼音,根据第五对应规则对应每个字输出与所述每个字的拼音对应的第五特征向量;
    根据拼接规则将所述第一特征向量、所述第二特征向量、所述第三特征向量、所述第四特征向量和所述第五特征向量对应拼接在每个字的初始向量之后,以得到用于表征所述中文病历的向量集;
    将用于表征所述中文病历的向量集输入训练好的模型以抽取其中的实体。
  16. 根据权利要求15所述的计算机可读存储介质,所述识别出所述中文病历中包含的个人信息包括以下步骤:创建用于匹配个人信息的正则表达式;用所述正则表达式与所述病人基本信息进行匹配,以识别出所述病人基本信息中包含的个人信息;所述第一对应规则包括病人类型与特征向量的对应规则和病人年龄与特征向量的对应规则;
    所述病人类型与特征向量的对应规则包括:特征向量的长度等于所述病人类型的种类数量;所述特征向量中每一维度对应所述病人类型中的一个种类;所述特征向量通过所述病人类型对应维度的向量值的改变表征对应的所述病人类型;
    或者,特征向量的长度为1;所述特征向量通过不同的向量值对应表征不同所述病人类型;
    所述病人年龄与特征向量的对应规则包括:特征向量的长度为1;所述特征向量通过不同的向量值对应表征不同所述病人年龄,所述向量值等于所述病人年龄。
  17. 根据权利要求15所述的计算机可读存储介质,所述第二对应规则包括:特征向量的长度为4;所述特征向量的前三个维度用于表征包含两个字以上词组,其中第一个维度的向量值的改变用于表征位于所述词组中首位的字,第二个维度的向量值的改变用于表征位于所述词组中中间的字,第三个维度的向量值的改变用于表征位于所述词组中末尾的字;所述特征向量的第四个维度用于表征单字词组,所述第四个维度的向量值的改变用于表征单字词组中的字;
    所述第五对应规则包括:特征向量的长度为1;所述特征向量通过不同的向量值对应表征不同所述拼音。
  18. 根据权利要求15所述的计算机可读存储介质,所述识别出所述中文病历中每个字的偏旁具体包括以下步骤:将所述中文病历中的每个字与预设的偏旁字典进行匹配,输出匹配到的偏旁,所述偏旁字典包含所有中文字与对应偏旁的关联关系;
    所述第三对应规则包括:特征向量的长度等于预设的实体偏旁的数量;所述特征向量中每一维度对应一个所述实体偏旁;所述特征向量通过所述实体偏旁对应维度的向量值的改变表征对应的包含所述实体偏旁的字;
    或者,特征向量的长度为1;所述特征向量通过不同的向量值对应表征包含不同所述实体偏旁的字。
  19. 根据权利要求15所述的计算机可读存储介质,所述前缀词典的构建包括以下步骤:识别出所述原始医学词典中多于两个字的词组;将识别出的所述词组的前i个字存入前缀词典,i为小于该词组长度且大于该词组长度的一半的自然数,其中该词组长度的一半取整数;
    所述后缀词典的构建包括以下步骤:识别出所述原始医学词典中多于两个字的词组;将识别出的所述词组的后i个字存入后缀词典,i为小于该词组长度且大于等于该词组长度的一半的自然数,其中该词组长度的一半取整数。
  20. 根据权利要求15所述的计算机可读存储介质,所述第四对应规则包括与原始医学词典相匹配时适用的对应规则、与前缀词典相匹配时适用的对应规则以及与后缀词典相匹配时适用的对应规则;
    所述与原始医学词典相匹配时适用的对应规则包括:特征向量的长度等于所述实体类别的数量;所述特征向量中每一维度对应一个所述实体类别;所述特征向量通过将初始向量值改为第一向量值、第二向量值或第三向量值对应表征单个字在词组中的首位、中间位或末位;
    所述与前缀词典相匹配时适用的对应规则包括:特征向量的长度等于所述实体类别的数量;所述特征向量中每一维度对应一个所述实体类别;所述特征向量通过将初始向量值改为第一向量值或第二向量值对应表征单个字在词组中的首位或非首位;
    所述与后缀词典相匹配时适用的对应规则包括:特征向量的长度等于所述实体类别的数量;所述特征向量中每一维度对应一个所述实体类别;所述特征向量通过将初始向量值改为第二向量值或第三向量值对应表征单个字在词组中的非末位或末位。
PCT/CN2019/103379 2019-04-19 2019-08-29 中文病历的实体识别方法、装置、设备及存储介质 WO2020211250A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
SG11202008377SA SG11202008377SA (en) 2019-04-19 2019-08-29 Entity recognizing method and apparatus of chinese medical record, device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910316061.1A CN110162784B (zh) 2019-04-19 2019-04-19 中文病历的实体识别方法、装置、设备及存储介质
CN201910316061.1 2019-04-19

Publications (1)

Publication Number Publication Date
WO2020211250A1 true WO2020211250A1 (zh) 2020-10-22

Family

ID=67638662

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/103379 WO2020211250A1 (zh) 2019-04-19 2019-08-29 中文病历的实体识别方法、装置、设备及存储介质

Country Status (3)

Country Link
CN (1) CN110162784B (zh)
SG (1) SG11202008377SA (zh)
WO (1) WO2020211250A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112542222A (zh) * 2020-12-21 2021-03-23 中南大学 基于深度学习的中文电子病历实体及关系联合抽取方法
CN117954038A (zh) * 2024-03-27 2024-04-30 江西曼荼罗软件有限公司 一种临床病历文本识别方法、系统、可读存储介质及设备

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162784B (zh) * 2019-04-19 2023-10-27 平安科技(深圳)有限公司 中文病历的实体识别方法、装置、设备及存储介质
CN110659639B (zh) * 2019-09-24 2021-11-05 北京字节跳动网络技术有限公司 汉字识别方法、装置、计算机可读介质及电子设备
CN112131862B (zh) * 2020-07-20 2021-12-03 中国中医科学院中医药信息研究所 一种中医医案数据处理方法、装置及电子设备
CN113609861B (zh) * 2021-08-10 2024-02-23 北京工商大学 基于食品文献数据的多维度特征命名实体识别方法及系统
CN114548087A (zh) * 2021-12-22 2022-05-27 毕胜普生物科技有限公司 中医文本处理方法、装置、计算机设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446526A (zh) * 2016-08-31 2017-02-22 北京千安哲信息技术有限公司 电子病历实体关系抽取方法及装置
CN106980608A (zh) * 2017-03-16 2017-07-25 四川大学 一种中文电子病历分词和命名实体识别方法及系统
WO2017172629A1 (en) * 2016-03-28 2017-10-05 Icahn School Of Medicine At Mount Sinai Systems and methods for applying deep learning to data
CN109388807A (zh) * 2018-10-30 2019-02-26 中山大学 电子病历命名实体识别的方法、装置及存储介质
CN110162784A (zh) * 2019-04-19 2019-08-23 平安科技(深圳)有限公司 中文病历的实体识别方法、装置、设备及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107808124B (zh) * 2017-10-09 2019-03-26 平安科技(深圳)有限公司 电子装置、医疗文本实体命名的识别方法及存储介质
CN108628824A (zh) * 2018-04-08 2018-10-09 上海熙业信息科技有限公司 一种基于中文电子病历的实体识别方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017172629A1 (en) * 2016-03-28 2017-10-05 Icahn School Of Medicine At Mount Sinai Systems and methods for applying deep learning to data
CN106446526A (zh) * 2016-08-31 2017-02-22 北京千安哲信息技术有限公司 电子病历实体关系抽取方法及装置
CN106980608A (zh) * 2017-03-16 2017-07-25 四川大学 一种中文电子病历分词和命名实体识别方法及系统
CN109388807A (zh) * 2018-10-30 2019-02-26 中山大学 电子病历命名实体识别的方法、装置及存储介质
CN110162784A (zh) * 2019-04-19 2019-08-23 平安科技(深圳)有限公司 中文病历的实体识别方法、装置、设备及存储介质

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112542222A (zh) * 2020-12-21 2021-03-23 中南大学 基于深度学习的中文电子病历实体及关系联合抽取方法
CN112542222B (zh) * 2020-12-21 2024-02-02 中南大学 基于深度学习的中文电子病历实体及关系联合抽取方法
CN117954038A (zh) * 2024-03-27 2024-04-30 江西曼荼罗软件有限公司 一种临床病历文本识别方法、系统、可读存储介质及设备

Also Published As

Publication number Publication date
SG11202008377SA (en) 2020-11-27
CN110162784A (zh) 2019-08-23
CN110162784B (zh) 2023-10-27

Similar Documents

Publication Publication Date Title
WO2020211250A1 (zh) 中文病历的实体识别方法、装置、设备及存储介质
US11531804B2 (en) Enhancing reading accuracy, efficiency and retention
Daud et al. Urdu language processing: a survey
CN110162782B (zh) 基于医学词典的实体提取方法、装置、设备及存储介质
CN112597774B (zh) 中文医疗命名实体识别方法、系统、存储介质和设备
US10339143B2 (en) Systems and methods for relation extraction for Chinese clinical documents
CN105184053B (zh) 一种中文医疗服务项目信息的自动编码方法及系统
US20190228074A1 (en) System for machine translation
Hasan et al. Neural clinical paraphrase generation with attention
Liang et al. A novel approach towards medical entity recognition in Chinese clinical text
CN105138829A (zh) 一种中文诊疗信息的自然语言处理方法及系统
CN111292814A (zh) 一种医疗数据标准化的方法及装置
Adduru et al. Towards Dataset Creation And Establishing Baselines for Sentence-level Neural Clinical Paraphrase Generation and Simplification.
Cohen et al. Text Classification
Kumar et al. Morphological analysis of the Dravidian language family
Spasić et al. Head to head: Semantic similarity of multi–word terms
Duan et al. Chinese EMR Named Entity Recognition Using Fused Label Relations Based on Machine Reading Comprehension Framework
Chen et al. A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation
Grasso et al. Beyond ner: towards semantics in clinical text
McTait Translation pattern extraction and recombination for example-based machine translation
Luo et al. Dissecting the ambiguity of fma concept names using taxonomy and partonomy structural information
Barrett Natural language processing techniques for the purpose of sentinel event information extraction
Savkov et al. Chunking clinical text containing non-canonical language
Wang et al. A Novel Method of Chinese Electronic Medical Records Entity Labeling Based on BIC model.
Ramasamy Parsing under-resourced languages: Cross-lingual transfer strategies for Indian languages

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 04/02/2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19924906

Country of ref document: EP

Kind code of ref document: A1