WO2022073341A1 - 基于语音语义的疾病实体匹配方法、装置及计算机设备 - Google Patents

基于语音语义的疾病实体匹配方法、装置及计算机设备 Download PDF

Info

Publication number
WO2022073341A1
WO2022073341A1 PCT/CN2021/090810 CN2021090810W WO2022073341A1 WO 2022073341 A1 WO2022073341 A1 WO 2022073341A1 CN 2021090810 W CN2021090810 W CN 2021090810W WO 2022073341 A1 WO2022073341 A1 WO 2022073341A1
Authority
WO
WIPO (PCT)
Prior art keywords
entity
disease
matching
disease entity
matched
Prior art date
Application number
PCT/CN2021/090810
Other languages
English (en)
French (fr)
Inventor
方春华
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022073341A1 publication Critical patent/WO2022073341A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to a method, apparatus and computer equipment for disease entity matching based on speech semantics.
  • a medical record is an individual's health information recorded in medical activities, in which a disease entity is recorded, that is, the name of a patient's disease.
  • a disease entity is recorded, that is, the name of a patient's disease.
  • the purpose of the embodiments of the present application is to propose a method, device, computer equipment and storage medium for disease entity matching based on speech semantics, so as to solve the problem of low disease entity matching efficiency.
  • the embodiments of the present application provide a method for matching disease entities based on speech semantics, which adopts the following technical solutions:
  • the disease entity matching dictionary includes matching disease entity pairs;
  • the initial disease entity matching model is completed Pre-trained BERT model
  • the entity to be matched is input into the disease entity matching model to perform entity matching, and an entity matching result is obtained.
  • the embodiments of the present application also provide a device for matching disease entities based on speech semantics, which adopts the following technical solutions:
  • a first acquisition module configured to acquire a disease entity matching dictionary and candidate disease entities; wherein, the disease entity matching dictionary includes matching disease entity pairs;
  • an entity combination module configured to combine the candidate disease entities in pairs to obtain a set of candidate disease entity pairs
  • an entity pair extraction module for randomly extracting candidate disease entity pairs from the candidate disease entity pair set
  • the sample input module is configured to use the extracted candidate disease entity pair as a negative sample and the matched disease entity pair as a positive sample, and input the positive sample and the negative sample into an initial disease entity matching model; wherein, the initial disease entity matching model;
  • the disease entity matching model is a pre-trained BERT model;
  • a model training module for training the initial disease entity matching model according to the positive sample and the negative sample, to obtain a disease entity matching model
  • the second acquisition module is used to acquire the entity to be matched
  • the entity matching module is configured to input the entity to be matched into the disease entity matching model to perform entity matching to obtain an entity matching result.
  • an embodiment of the present application further provides a computer device, including a memory and a processor, wherein the memory stores computer-readable instructions, and the processor implements the following steps when executing the computer-readable instructions:
  • the disease entity matching dictionary includes matching disease entity pairs;
  • the initial disease entity matching model is completed Pre-trained BERT model
  • the entity to be matched is input into the disease entity matching model to perform entity matching, and an entity matching result is obtained.
  • the embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, the following steps are implemented:
  • the disease entity matching dictionary includes matching disease entity pairs;
  • the initial disease entity matching model is completed Pre-trained BERT model
  • the entity to be matched is input into the disease entity matching model to perform entity matching, and an entity matching result is obtained.
  • the embodiments of the present application mainly have the following beneficial effects: after obtaining the disease entity matching dictionary and the candidate disease entities, the candidate disease entities are combined in pairs to construct a negative sample, and the disease entity matching dictionary is used as a positive sample; Input positive samples and negative samples into the initial disease entity matching model for full training.
  • the initial disease entity matching model can be a pre-trained BERT model with rich semantic information. When the training sample size is small, accurate matching effects can also be obtained. , shortens the training time and improves the training efficiency of the disease entity matching model; after the training is completed, the disease entity matching model can perform entity matching on the input entities to be matched, which improves the efficiency of disease entity matching.
  • FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;
  • FIG. 2 is a flowchart of an embodiment of a method for matching disease entities based on speech semantics according to the present application
  • FIG. 3 is a schematic structural diagram of an embodiment of a device for matching disease entities based on speech semantics according to the present application
  • FIG. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.
  • the system architecture 100 may include terminal devices 101 , 102 , and 103 , a network 104 and a server 105 .
  • the network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 .
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like.
  • Various communication client applications may be installed on the terminal devices 101 , 102 and 103 , such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, and the like.
  • the terminal devices 101, 102, and 103 can be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic Picture Experts Compression Standard Audio Layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Players, Laptops and Desktops, etc.
  • MP3 players Moving Picture Experts Group Audio Layer III, dynamic Picture Experts Compression Standard Audio Layer 3
  • MP4 Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4
  • the server 105 may be a server that provides various services, such as a background server that provides support for the pages displayed on the terminal devices 101 , 102 , and 103 .
  • the method for matching disease entities based on speech semantics provided in the embodiments of the present application is generally performed by a server, and accordingly, the apparatus for matching disease entities based on speech semantics is generally set in the server.
  • This application can be applied in the field of medical science and technology.
  • terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
  • the method for matching disease entities based on speech semantics includes the following steps:
  • Step S201 obtaining a disease entity matching dictionary and candidate disease entities; wherein, the disease entity matching dictionary includes matching disease entity pairs.
  • the electronic device eg, the server shown in FIG. 1
  • the electronic device may communicate with the terminal device through a wired connection or a wireless connection.
  • the above wireless connection methods may include but are not limited to 3G/4G connection, WiFi connection, Bluetooth connection, WiMAX connection, Zigbee connection, UWB (ultra wideband) connection, and other wireless connection methods currently known or developed in the future .
  • the disease entity matching dictionary is used to record matched disease entity pairs; the matched disease entity pair may be a combination of matched disease entities.
  • Candidate disease entities can be individual disease entities used to construct training samples.
  • the server after receiving the model training instruction, obtains the disease entity matching dictionary and candidate disease entities from the database, or receives the disease entity matching dictionary and candidate disease entities from the terminal.
  • the application does not have high requirements on the scale of the disease entity matching dictionary, and a small-scale disease entity matching dictionary can meet the training requirements, which saves the labor cost and time cost of constructing the disease entity matching dictionary.
  • the above-mentioned disease entity matching dictionary may also be stored in a node of a blockchain.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • step S202 the candidate disease entities are combined in pairs to obtain a set of candidate disease entity pairs.
  • the server combines the candidate disease entities in pairs to obtain multiple sets of candidate disease entity pairs, and all the candidate disease entity pairs constitute a candidate disease entity pair set. For example, when there are 100 candidate disease entities, we get Set of candidate disease entity pairs, 4950 sets of candidate disease entity pairs constitute the candidate disease entity pair set.
  • Step S203 randomly extract candidate disease entity pairs from the candidate disease entity pair set.
  • the server may not necessarily use the entire set of candidate disease entity pairs for training.
  • the set size of candidate disease entity pairs will also be larger.
  • the server may randomly extract a preset number of candidate disease entity pairs from the candidate disease entity pair set.
  • Step S204 taking the extracted candidate disease entity pairs as negative samples and matching disease entity pairs as positive samples, and inputting the positive samples and negative samples into the initial disease entity matching model; wherein, the initial disease entity matching model is a BERT model that has completed pre-training .
  • the samples input by the server to the initial disease entity matching model include both positive samples and negative samples, so as to fully train the initial disease entity matching model; wherein, the extracted candidate disease entities will be used as negative samples, and the disease entity matching dictionary in the Match disease entity pairs as positive samples.
  • the server inputs the positive samples and negative samples into the initial disease entity matching model, and the initial disease entity matching model can be a pre-trained BERT (Bidirectional Encoder Representation from Transformers) model.
  • BERT Bidirectional Encoder Representation from Transformers
  • the steps may further include: acquiring a medical corpus data set; inputting the medical corpus data set into the BERT model for pre-training to obtain an initial disease entity matching model.
  • the medical corpus data set may be a data set composed of medical corpus information.
  • the server obtains a medical corpus data set, and the medical corpus information in the medical corpus data set may come from various medical disease fields.
  • the server pre-trains the BERT model according to the medical corpus data set to obtain the initial disease entity matching model.
  • the BERT model learns rich semantic information, so that the initial disease entity matching model can be effectively trained even when the sample size is limited, and after training, it can achieve high matching when facing disease entities in different fields Accuracy.
  • the Masked language model is used in the BERT model to overcome the one-way limitation of pre-training from left to right and the inability to utilize the contextual information.
  • the masked language model can represent the fusion contextual information.
  • the masking language model randomly replaces a certain proportion of tokens (units in natural language processing, such as words) with masks (masks), and then sends the output of the last hidden layer corresponding to the mask to softmax (logistic regression) layer, used to predict the original string corresponding to the masked token.
  • the BERT model transfers a large number of operations done in downstream natural language processing tasks to the pre-trained word vector.
  • a classifier is added on the basis of the word vector. For example, for sentence pair or entity pair classification tasks, on the basis of pre-training and fine-tuning according to downstream tasks, the BERT model obtains the representation of the last layer, plus the softmax layer to predict the probability.
  • the representation of the last layer can learn semantic-level information and utilize the information of the previous layers.
  • the BERT model is trained through the medical corpus data set, so that the BERT model learns rich semantic information and ensures the accuracy of disease entity matching.
  • step S205 an initial disease entity matching model is trained according to the positive samples and the negative samples to obtain a disease entity matching model.
  • the server inputs positive samples and negative samples into the initial disease entity matching model, and the initial disease entity matching model outputs matching prediction results respectively according to the input samples, and the matching prediction results may be a binary classification result.
  • the initial disease entity matching model calculates the model loss according to the matching prediction result and the sample label, where the sample label of the positive sample takes one value, and the sample label of the negative sample takes another value.
  • the server adjusts the parameters of the initial disease entity matching model with the goal of reducing the model loss, and then continues to train the initial disease entity matching model according to the positive samples and negative samples, until the model converges, and the disease entity matching model is obtained.
  • the model loss can be calculated according to the Focal Loss loss function.
  • Step S206 acquiring the entity to be matched.
  • the entity to be matched is the input disease entity, which is used for disease entity matching.
  • the disease entity matching can be performed.
  • the user can input the entity to be matched through the terminal, and the terminal sends the entity to be matched to the server.
  • Step S207 input the entity to be matched into the disease entity matching model to perform entity matching, and obtain an entity matching result.
  • the server inputs the entity to be matched into the disease entity matching model, and the disease entity matching model can perform entity matching on a single entity to be matched, and output the matched disease entity as the matching result; it can also perform entity matching on multiple entities to be matched. Process, and output the matched disease entity pair among the multiple entities to be matched as the entity matching result.
  • the candidate disease entities are combined in pairs to construct negative samples, and the disease entity matching dictionary is used as a positive sample; the positive samples and negative samples are input into the initial disease entity matching model
  • the initial disease entity matching model can be a pre-trained BERT model with rich semantic information.
  • the training sample size is small, accurate matching results can be obtained, which shortens the training time and improves the disease entity.
  • the training efficiency of the matching model after the training is completed, the disease entity matching model can perform entity matching on the input entities to be matched, which improves the efficiency of disease entity matching.
  • the steps may further include: acquiring disease corpus information; identifying matching disease entity pairs in the disease corpus information through semantic information; constructing a disease entity matching dictionary based on the identified matching disease entity pairs.
  • the disease corpus information may be disease-related corpus information.
  • the server obtains disease corpus information, and the disease corpus information can be obtained through a crawler. Crawlers can crawl disease-related entry pages to obtain disease corpus information.
  • the server performs semantic annotation on the disease corpus information according to the semantic knowledge base, and obtains matching disease entity pairs in the disease corpus information according to the semantic annotation result. For example, "Y1, also known as Y2" is recorded in a disease-related entry page, and the server obtains Y1 and Y2 through semantic information, which can be used as a matching disease entity pair. According to the identified matching disease entity pairs, the server can construct a disease entity matching dictionary.
  • Disease corpus information can also be manually selected and input into the server, and matching disease entity pairs can be manually annotated with disease corpus information.
  • the disease entity matching dictionary constructed based on the disease corpus information is used to train the initial disease entity matching model, which ensures the smooth realization of the model training.
  • step S203 may include: acquiring the complement of the candidate disease entity pair set in the disease entity matching dictionary; randomly extracting a preset number of candidate disease entity pairs from the complement set; calculating the entity similarity of the extracted candidate disease entity pairs degree; screen candidate disease entity pairs whose entity similarity is less than the similarity threshold.
  • the server first seeks the complement of the candidate disease entity pair set in the disease entity matching dictionary, thereby deleting the candidate disease entity pair that already exists in the disease entity matching dictionary, and then extracts a preset number of candidate disease entity pairs from the complement set. .
  • the server calculates the entity similarity, which is the similarity between the two candidate disease entities in the candidate disease entity pair.
  • entity similarity is the similarity between the two candidate disease entities in the candidate disease entity pair.
  • There are many methods for calculating entity similarity such as calculating entity similarity by Jaccard coefficient, N-Gram (also known as N-gram model), Levenshtein distance (also known as text edit distance), and cosine similarity.
  • the server may use one of the above-mentioned methods alone, or may use a plurality of the above-mentioned methods in combination.
  • the candidate disease entities are divided into characters, and the calculation formula is as follows:
  • a and B represent candidate disease entities
  • Jaccard(A,B) represents entity similarity
  • len(A ⁇ B) represents the number of identical characters in A and B
  • len(A ⁇ B) represents the composition of A and B The required number of distinct characters.
  • the candidate disease entity is divided by length N to obtain phrases, where the tail of the previous phrase is the head of the next phrase, for example, "diabetes” is parsed as ⁇ "$sugar” , "diabetes”, “diabetes”, “disease $" ⁇ , where $ is a filler character, and the N value is generally 2 or 3. Then calculate the entity similarity with the following formula:
  • M and N represent candidate disease entities
  • Jaccard(M,N) is the entity similarity between M and N
  • len(M ⁇ N) represents the number of identical phrases in M and N
  • len(M ⁇ N) Indicates the number of distinct phrases required to form M and N.
  • the server After obtaining the entity similarity, the server obtains the preset similarity threshold, compares the entity similarity with the similarity threshold, deletes the candidate disease entity pairs whose entity similarity is greater than or equal to the similarity threshold, and retains the entity similarity less than the similarity Threshold candidate disease entity pairs to remove candidate disease entity pairs with higher similarity.
  • Candidate disease entity pairs will be used as negative samples.
  • Candidate disease entity pairs that already exist in the disease entity matching dictionary and candidate disease entity pairs with high entity similarity will have a negative impact on model training and need to be removed.
  • the candidate disease entity pairs with high similarity are removed, and the accuracy of the negative samples constructed according to the candidate disease entity pairs is ensured.
  • step S205 may include: splicing positive samples and negative samples respectively, and adding sample labels to obtain samples to be processed; inputting the samples to be processed into the network layer of the initial disease entity matching model to obtain the representation vector of the samples to be processed. ; Calculate the characterization vector and output the matching prediction probability; calculate the model loss according to the matching prediction probability and the sample label; adjust the model parameters of the initial disease entity matching model according to the model loss, until the model converges, and obtain the disease entity matching model.
  • positive samples and negative samples are simultaneously input to the initial disease entity matching model.
  • the initial disease entity matching model handles positive samples and negative samples in the same way, adding [SEP] characters between the two candidate disease entities, and then splicing them together; then adding [CLS] and [SEP] to the beginning and end of the spliced string respectively. ] character; the server can also add sample labels, wherein the sample labels of positive samples are the same, and the sample labels of negative samples are the same, and the samples to be processed are obtained.
  • the samples to be processed are input into the network layer of the initial disease entity matching model, and a representation vector sequence_output of the samples to be processed is output.
  • the dimension of the representation vector may be 1*768.
  • the server performs matrix operations on the characterization vector, multiplies the bias matrix [1, 2], and adds the softmax (logistic regression) layer to obtain the matching prediction probability.
  • the matching prediction probability is a vector of 1*2, representing two entities respectively. Probability of match and mismatch.
  • the server calculates the cross entropy according to the matching prediction probability and the sample label to obtain the model loss, adjusts the model parameters of the initial disease entity matching model with the goal of reducing the model loss, and then retrains until the model converges to obtain the disease entity matching model. When the model converges, the model loss is less than the preset loss threshold.
  • the sample is processed to output the matching prediction probability
  • the model loss is calculated according to the sample label
  • the model is fine-tuned according to the model loss until the model converges, and the obtained disease entity matching model can accurately judge the matching of disease entities.
  • the above step S207 may include: obtaining a disease entity dictionary; combining the entity to be matched with each disease entity in the disease entity dictionary to obtain a first pair of entities to be matched; The input disease entity is matched to the model to obtain the matched disease entity pair; according to the matched disease entity pair, the disease entity matching the entity to be matched is determined in the disease entity dictionary, and the determined disease entity is used as the entity matching result.
  • the disease entity dictionary may be a dictionary that records disease entities.
  • a disease entity matching model can be used to match a single disease entity to be matched.
  • the user can input the entity to be matched through the terminal.
  • the server obtains the entity to be matched and reads the stored disease entity dictionary. A large number of disease entities are recorded in the disease entity dictionary, and the server combines the entities to be matched with each disease entity in the disease entity dictionary one by one to obtain multiple sets of first entity pairs to be matched.
  • the server inputs the first entity pair to be matched into the disease entity matching model to determine whether the entity to be matched in the first entity pair to be matched matches the disease entity, and if it can match, it will be marked as a matching disease entity pair.
  • the server takes the disease entity from the disease entity dictionary in the matched disease entity pair as the entity matching result, and outputs the entity matching result to the terminal to display the disease entity that matches the entity to be matched, so that the user does not need to search and find from the Internet.
  • Disease entities related to the entity to be matched convenient and efficient.
  • the server can also inquire whether the entity to be matched exists in the disease entity dictionary, and if not, the to-be-matched entity is added to the disease entity dictionary to expand the disease entity dictionary and improve the matching ability of the to-be-matched entity.
  • the disease entity matching model performs matching judgment between the entity to be matched and the disease entity in the disease entity dictionary, which can quickly realize entity matching of the entity to be matched.
  • the above step S207 may further include: combining the entities to be matched in pairs to obtain a second entity pair to be matched; inputting the second entity pair to be matched into a disease entity matching model to obtain a second entity pair to be matched The matched disease entity pair in the entity pair is matched, and the obtained matched disease entity pair is used as the entity matching result.
  • the disease entity matching model can also process multiple entities to be matched at the same time, and output matching disease entity pairs among the multiple entities to be matched.
  • the user can input multiple entities to be matched at the same time, the server firstly combines the multiple entities to be matched to obtain the second entity pair to be matched, and then inputs the second entity pair to be matched into the disease entity matching model. Quickly identify matching disease entity pairs existing in multiple entities to be matched, and output the obtained matching disease entity pairs as entity matching results to the terminal for display.
  • the entities to be matched are input into the disease entity matching model in pairs, so that all entity combinations can be quickly judged, which improves the matching efficiency.
  • the method for matching disease entities based on speech semantics in this application involves neural networks, machine learning and natural language processing in the field of artificial intelligence; in addition, it may also involve smart medical care in the field of smart cities.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.
  • the present application provides an embodiment of a device for matching disease entities based on speech semantics, and the embodiment of the device corresponds to the method embodiment shown in FIG. 2 .
  • the device can be specifically applied to various electronic devices.
  • the apparatus 300 for disease entity matching based on speech semantics in this embodiment includes: a first acquisition module 301, an entity combination module 302, an entity pair extraction module 303, a sample input module 304, a model training module 305, The second acquiring module 306 and the entity matching module 307, wherein:
  • the first obtaining module 301 is configured to obtain a disease entity matching dictionary and candidate disease entities; wherein, the disease entity matching dictionary includes matching disease entity pairs.
  • the entity combination module 302 is configured to perform pairwise combination of candidate disease entities to obtain a set of candidate disease entity pairs.
  • the entity pair extraction module 303 is configured to randomly extract candidate disease entity pairs from the set of candidate disease entity pairs.
  • the sample input module 304 is configured to use the extracted candidate disease entity pairs as negative samples and the matched disease entity pairs as positive samples, and input the positive samples and negative samples into the initial disease entity matching model; wherein, the initial disease entity matching model is to complete the prediction. Trained BERT model.
  • the model training module 305 is used for training an initial disease entity matching model according to the positive samples and negative samples to obtain a disease entity matching model.
  • the second obtaining module 306 is configured to obtain the entity to be matched.
  • the entity matching module 307 is configured to input the entity to be matched into the disease entity matching model to perform entity matching to obtain an entity matching result.
  • the candidate disease entities are combined in pairs to construct negative samples, and the disease entity matching dictionary is used as a positive sample; the positive samples and negative samples are input into the initial disease entity matching model
  • the initial disease entity matching model can be a pre-trained BERT model with rich semantic information.
  • the training sample size is small, accurate matching results can be obtained, which shortens the training time and improves the disease entity.
  • the training efficiency of the matching model after the training is completed, the disease entity matching model can perform entity matching on the input entities to be matched, which improves the efficiency of disease entity matching.
  • the above apparatus 300 for disease entity matching based on speech semantics further includes: an information acquisition module, an entity pair identification module, and a dictionary construction module, wherein:
  • the information acquisition module is used to acquire disease corpus information.
  • the entity pair recognition module is used to identify matching disease entity pairs in disease corpus information through semantic information.
  • a dictionary building module is used to construct a disease entity matching dictionary based on the identified matching disease entity pairs.
  • the disease entity matching dictionary constructed based on the disease corpus information is used to train the initial disease entity matching model, which ensures the smooth realization of the model training.
  • the entity pair extraction module 303 includes: a complement acquisition submodule, an entity pair extraction submodule, a similarity calculation submodule, and an entity pair screening submodule, wherein:
  • the complement obtaining submodule is used to obtain the complement of the candidate disease entity pair set in the disease entity matching dictionary.
  • the entity pair extraction submodule is used to randomly extract a preset number of candidate disease entity pairs from the complement set.
  • the similarity calculation submodule is used to calculate the entity similarity of the extracted candidate disease entity pairs.
  • the entity pair screening submodule is used to screen candidate disease entity pairs whose entity similarity is less than the similarity threshold.
  • the candidate disease entity pairs with high similarity are removed, and the accuracy of the negative samples constructed according to the candidate disease entity pairs is ensured.
  • the above-mentioned model training module 305 includes: a sample splicing sub-module, a sample input sub-module, a vector calculation sub-module, a loss calculation sub-module, and a parameter adjustment sub-module, wherein:
  • the sample splicing sub-module is used to splicing positive samples and negative samples respectively, and adding sample labels to obtain samples to be processed.
  • the sample input sub-module is used to input the sample to be processed into the network layer of the initial disease entity matching model to obtain the representation vector of the sample to be processed.
  • the vector calculation sub-module is used to calculate the characterization vector and output the matching prediction probability.
  • the loss calculation submodule is used to calculate the model loss according to the matching prediction probability and sample label.
  • the parameter adjustment submodule is used to adjust the model parameters of the initial disease entity matching model according to the model loss, until the model converges, and the disease entity matching model is obtained.
  • the sample is processed to output the matching prediction probability
  • the model loss is calculated according to the sample label
  • the model is fine-tuned according to the model loss until the model converges, and the obtained disease entity matching model can accurately judge the matching of disease entities.
  • the above apparatus 300 for disease entity matching based on speech semantics further includes: a data set acquisition module and a data set input module, wherein:
  • the data set acquisition module is used to acquire the medical corpus data set.
  • the dataset input module is used to input the medical corpus dataset into the BERT model for pre-training to obtain the initial disease entity matching model.
  • the BERT model is trained through the medical corpus data set, so that the BERT model learns rich semantic information and ensures the accuracy of disease entity matching.
  • the entity matching module 307 includes: a dictionary acquisition submodule, a first combination submodule, a first input submodule, and an entity determination submodule, wherein:
  • the dictionary obtaining submodule is used to obtain the disease entity matching dictionary.
  • the first combining submodule is used to combine the entity to be matched with each disease entity in the disease entity matching dictionary to obtain a first pair of entities to be matched.
  • the first input sub-module is configured to input the first entity pair to be matched into the disease entity matching model to obtain the matched disease entity pair.
  • the entity determination submodule is used to determine the disease entity matching the entity to be matched in the disease entity matching dictionary according to the matched disease entity pair, and use the determined disease entity as the entity matching result.
  • the disease entity matching model performs matching judgment between the entity to be matched and the disease entity in the disease entity dictionary, which can quickly realize entity matching of the entity to be matched.
  • the disease entity matching based on speech semantics by the entity matching module 307 includes: a second combination sub-module and a second input sub-module, wherein:
  • the second combination sub-module is used for pairwise combination of entities to be matched to obtain a second pair of entities to be matched.
  • the second input submodule is configured to input the second entity pair to be matched into the disease entity matching model, obtain the matched disease entity pair in the second entity pair to be matched, and use the obtained matched disease entity pair as the entity matching result.
  • the entities to be matched are input into the disease entity matching model in pairs, so that all entity combinations can be quickly judged, which improves the matching efficiency.
  • FIG. 4 is a block diagram of a basic structure of a computer device according to this embodiment.
  • the computer device 4 includes a memory 41, a processor 42, and a network interface 43 that communicate with each other through a system bus. It should be noted that only the computer device 4 with components 41-43 is shown in the figure, but it should be understood that it is not required to implement all of the shown components, and more or less components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • the computer equipment may be a desktop computer, a notebook computer, a palmtop computer, a cloud server and other computing equipment.
  • the computer device can perform human-computer interaction with the user through a keyboard, a mouse, a remote control, a touch pad or a voice control device.
  • the memory 41 includes at least one type of computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium includes flash memory, hard disk, and multimedia card. , card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable Program read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc.
  • the memory 41 may be an internal storage unit of the computer device 4 , such as a hard disk or a memory of the computer device 4 .
  • the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc.
  • the memory 41 may also include both the internal storage unit of the computer device 4 and its external storage device.
  • the memory 41 is generally used to store the operating system and various application software installed on the computer device 4 , such as computer-readable instructions for a method for matching disease entities based on speech semantics.
  • the memory 41 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 42 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. This processor 42 is typically used to control the overall operation of the computer device 4 . In this embodiment, the processor 42 is configured to execute computer-readable instructions stored in the memory 41 or process data, for example, computer-readable instructions for executing the method for matching disease entities based on speech semantics.
  • CPU Central Processing Unit
  • controller a controller
  • microcontroller a microcontroller
  • microprocessor microprocessor
  • This processor 42 is typically used to control the overall operation of the computer device 4 .
  • the processor 42 is configured to execute computer-readable instructions stored in the memory 41 or process data, for example, computer-readable instructions for executing the method for matching disease entities based on speech semantics.
  • the network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is generally used to establish a communication connection between the computer device 4 and other electronic devices.
  • the computer device provided in this embodiment can execute the steps of the above-mentioned method for matching disease entities based on speech semantics.
  • the steps of the method for matching disease entities based on speech semantics herein may be the steps in the methods for matching disease entities based on speech semantics in the above embodiments.
  • the candidate disease entities are combined in pairs to construct negative samples, and the disease entity matching dictionary is used as a positive sample; the positive samples and negative samples are input into the initial disease entity matching model
  • the initial disease entity matching model can be a pre-trained BERT model with rich semantic information.
  • the training sample size is small, accurate matching results can be obtained, which shortens the training time and improves the disease entity.
  • the training efficiency of the matching model after the training is completed, the disease entity matching model can perform entity matching on the input entities to be matched, which improves the efficiency of disease entity matching.
  • the present application also provides another embodiment, that is, to provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor to The at least one processor is caused to perform the steps of the method for matching disease entities based on speech semantics as described above.
  • the candidate disease entities are combined in pairs to construct negative samples, and the disease entity matching dictionary is used as a positive sample; the positive samples and negative samples are input into the initial disease entity matching model
  • the initial disease entity matching model can be a pre-trained BERT model with rich semantic information.
  • the training sample size is small, accurate matching results can be obtained, which shortens the training time and improves the disease entity.
  • the training efficiency of the matching model after the training is completed, the disease entity matching model can perform entity matching on the input entities to be matched, which improves the efficiency of disease entity matching.
  • the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation.
  • the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.
  • a storage medium such as ROM/RAM, magnetic disk, CD-ROM

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Signal Processing (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

一种基于语音语义的疾病实体匹配方法、装置、计算机设备及存储介质,所述方法包括:获取包含匹配疾病实体对的疾病实体匹配词典以及候选疾病实体;对候选疾病实体进行两两组合,得到候选疾病实体对集合,并从中随机抽取候选疾病实体对;以抽取到的候选疾病实体对作为负样本、匹配疾病实体对作为正样本,将正样本和负样本输入初始疾病实体匹配模型进行模型训练,得到疾病实体匹配模型;获取待匹配实体并输入疾病实体匹配模型,得到实体匹配结果。此外,本方法还涉及区块链技术,疾病实体匹配词典可存储于区块链中。本方法提高了疾病实体匹配效率。

Description

基于语音语义的疾病实体匹配方法、装置及计算机设备
本申请要求于2020年10月10日提交中国专利局、申请号为202011080585.4,发明名称为“基于语音语义的疾病实体匹配方法、装置及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种基于语音语义的疾病实体匹配方法、装置及计算机设备。
背景技术
病历是在医疗活动中记录的个体健康信息,病历中记录了疾病实体,即病人所患疾病的名称。病历中记载的疾病实体可能存在多种表达,例如,强迫性障碍和强迫症属于同一种疾病,因此经常需要判断两个疾病实体是否匹配。
传统的疾病实体匹配,有的由人工进行判断,在疾病实体较多时,人工判断需要大量时间,效率低下。有的是借助计算机进行疾病实体匹配,例如对疾病实体进行属性匹配、上下文匹配等。然而,发明人意识到,这些匹配技术都需要预先获取大规模的疾病语料,且对语料质量要求较高,因此语料的收集和预处理所需时间较长,导致疾病实体匹配的效率依然较低。
发明内容
本申请实施例的目的在于提出一种基于语音语义的疾病实体匹配方法、装置、计算机设备及存储介质,以解决疾病实体匹配效率较低的问题。
为了解决上述技术问题,本申请实施例提供一种基于语音语义的疾病实体匹配方法,采用了如下所述的技术方案:
获取疾病实体匹配词典以及候选疾病实体;其中,所述疾病实体匹配词典中包括匹配疾病实体对;
对所述候选疾病实体进行两两组合,得到候选疾病实体对集合;
从所述候选疾病实体对集合中随机抽取候选疾病实体对;
以抽取到的候选疾病实体对作为负样本、所述匹配疾病实体对作为正样本,将所述正样本和所述负样本输入初始疾病实体匹配模型;其中,所述初始疾病实体匹配模型为完成预训练的BERT模型;
根据所述正样本和所述负样本训练所述初始疾病实体匹配模型,得到疾病实体匹配模型;
获取待匹配实体;
将所述待匹配实体输入所述疾病实体匹配模型进行实体匹配,得到实体匹配结果。
为了解决上述技术问题,本申请实施例还提供一种基于语音语义的疾病实体匹配装置,采用了如下所述的技术方案:
第一获取模块,用于获取疾病实体匹配词典以及候选疾病实体;其中,所述疾病实体匹配词典中包括匹配疾病实体对;
实体组合模块,用于对所述候选疾病实体进行两两组合,得到候选疾病实体对集合;
实体对抽取模块,用于从所述候选疾病实体对集合中随机抽取候选疾病实体对;
样本输入模块,用于以抽取到的候选疾病实体对作为负样本、所述匹配疾病实体对作为正样本,将所述正样本和所述负样本输入初始疾病实体匹配模型;其中,所述初始疾病实体匹配模型为完成预训练的BERT模型;
模型训练模块,用于根据所述正样本和所述负样本训练所述初始疾病实体匹配模型,得到疾病实体匹配模型;
第二获取模块,用于获取待匹配实体;
实体匹配模块,用于将所述待匹配实体输入所述疾病实体匹配模型进行实体匹配,得到实体匹配结果。
为了解决上述技术问题,本申请实施例还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
获取疾病实体匹配词典以及候选疾病实体;其中,所述疾病实体匹配词典中包括匹配疾病实体对;
对所述候选疾病实体进行两两组合,得到候选疾病实体对集合;
从所述候选疾病实体对集合中随机抽取候选疾病实体对;
以抽取到的候选疾病实体对作为负样本、所述匹配疾病实体对作为正样本,将所述正样本和所述负样本输入初始疾病实体匹配模型;其中,所述初始疾病实体匹配模型为完成预训练的BERT模型;
根据所述正样本和所述负样本训练所述初始疾病实体匹配模型,得到疾病实体匹配模型;
获取待匹配实体;
将所述待匹配实体输入所述疾病实体匹配模型进行实体匹配,得到实体匹配结果。
为了解决上述技术问题,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下步骤:
获取疾病实体匹配词典以及候选疾病实体;其中,所述疾病实体匹配词典中包括匹配疾病实体对;
对所述候选疾病实体进行两两组合,得到候选疾病实体对集合;
从所述候选疾病实体对集合中随机抽取候选疾病实体对;
以抽取到的候选疾病实体对作为负样本、所述匹配疾病实体对作为正样本,将所述正样本和所述负样本输入初始疾病实体匹配模型;其中,所述初始疾病实体匹配模型为完成预训练的BERT模型;
根据所述正样本和所述负样本训练所述初始疾病实体匹配模型,得到疾病实体匹配模型;
获取待匹配实体;
将所述待匹配实体输入所述疾病实体匹配模型进行实体匹配,得到实体匹配结果。
与现有技术相比,本申请实施例主要有以下有益效果:获取疾病实体匹配词典以及候选疾病实体后,对候选疾病实体进行两两组合以构建负样本,以疾病实体匹配词典作为正样本;将正样本和负样本输入初始疾病实体匹配模型以进行充分训练,初始疾病实体匹配模型可以是完成预训练的BERT模型,具有丰富的语义信息,当训练样本规模较小时也可以获得精准的匹配效果,缩短了训练所需时间,提高了疾病实体匹配模型的训练效率;训练完成后,疾病实体匹配模型即可对输入的待匹配实体进行实体匹配,提高了疾病实体匹配的效率。
附图说明
为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图作一个简单介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请可以应用于其中的示例性系统架构图;
图2是根据本申请的基于语音语义的疾病实体匹配方法的一个实施例的流程图;
图3是根据本申请的基于语音语义的疾病实体匹配装置的一个实施例的结构示意图;
图4是根据本申请的计算机设备的一个实施例的结构示意图。
具体实施方式
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
为了使本技术领域的人员更好地理解本申请方案,下面将结合附图,对本申请实施例中的技术方案进行清楚、完整地描述。
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如网页浏览器应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上显示的页面提供支持的后台服务器。
需要说明的是,本申请实施例所提供的基于语音语义的疾病实体匹配方法一般由服务器执行,相应地,基于语音语义的疾病实体匹配装置一般设置于服务器中。本申请可应用于医疗科技领域。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
继续参考图2,示出了根据本申请的基于语音语义的疾病实体匹配方法的一个实施例的流程图。所述的基于语音语义的疾病实体匹配方法,包括以下步骤:
步骤S201,获取疾病实体匹配词典以及候选疾病实体;其中,疾病实体匹配词典中包括匹配疾病实体对。
在本实施例中,基于语音语义的疾病实体匹配方法运行于其上的电子设备(例如图1所示的服务器)可以通过有线连接方式或者无线连接方式与终端设备进行通信。需要指出的是,上述无线连接方式可以包括但不限于3G/4G连接、WiFi连接、蓝牙连接、WiMAX连接、Zigbee连接、UWB(ultra wideband)连接、以及其他现在已知或将来开发的无线连接方式。
其中,疾病实体匹配词典用于记录匹配疾病实体对;匹配疾病实体对可以是匹配的疾病实体的组合。候选疾病实体可以是单独的疾病实体,用于构建训练样本。
具体地,服务器接收到模型训练指令后,从数据库中获取疾病实体匹配词典以及候选疾病实体,或者从终端接收疾病实体匹配词典以及候选疾病实体。本申请对疾病实体匹配词典的规模要求不高,小规模的疾病实体匹配词典即可满足训练需求,节约了构建疾病实体匹配词典的人力成本以及时间成本。
需要强调的是,为进一步保证上述疾病实体匹配词典的私密和安全性,上述疾病实体匹配词典还可以存储于一区块链的节点中。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
步骤S202,对候选疾病实体进行两两组合,得到候选疾病实体对集合。
具体地,服务器将候选疾病实体进行两两组合,得到多组候选疾病实体对,全部候选疾病实体对构成候选疾病实体对集合。举例说明,当候选疾病实体有100个时,两两组合后得到
Figure PCTCN2021090810-appb-000001
组候选疾病实体对,4950组候选疾病实体对组成了候选疾病实体对集合。
步骤S203,从候选疾病实体对集合中随机抽取候选疾病实体对。
具体地,服务器可以不必将整个候选疾病实体对集合用于训练。当候选疾病实体较多时,候选疾病实体对集合规模也会较大。为了提高处理速度,服务器可以随机从候选疾病实体对集合中抽取预设数量的候选疾病实体对。
步骤S204,以抽取到的候选疾病实体对作为负样本、匹配疾病实体对作为正样本,将正样本和负样本输入初始疾病实体匹配模型;其中,初始疾病实体匹配模型为完成预训练的BERT模型。
具体地,服务器输入初始疾病实体匹配模型的样本既包含正样本,又包含负样本,以充分训练初始疾病实体匹配模型;其中,抽取到的候选疾病实体将作为负样本,疾病实体匹配词典中的匹配疾病实体对作为正样本。
服务器将正样本和负样本输入初始疾病实体匹配模型,初始疾病实体匹配模型可以是完成了预训练的BERT(Bidirectional Encoder Representation from Transformers)模型。
在一个实施例中,上述步骤S205之前还可以包括:获取医学语料数据集;将医学语料数据集输入BERT模型以进行预训练,得到初始疾病实体匹配模型。
其中,医学语料数据集可以是医学语料信息组成的数据集。
具体地,服务器获取医学语料数据集,医学语料数据集中的医学语料信息可以来自各个医学疾病领域。服务器根据医学语料数据集对BERT模型进行预训练,得到初始疾病实体匹配模型。BERT模型学习了丰富的语义信息,使得初始疾病实体匹配模型在样本规模有限的情况下,也可以有效地进行训练,且在训练完毕后,面对不同领域的疾病实体时能够达到较高的匹配准确率。
BERT模型中使用了Masked language model即遮蔽语言模型,用以克服从左到右的预训练与以及无法利用下文信息的单向局限性,遮蔽语言模型能够表征融合上下文信息。
遮蔽语言模型随机将一定比例的token(自然语言处理中的单位,例如可以是单词)替换成mask(掩膜),然后将mask对应位置的最后一层隐藏层的输出送入softmax(逻辑回归)层,用来预测被mask掉的token所对应的原始字符串。
BERT模型将大量在下游自然语言处理任务中做的操作转移到预训练词向量中,通过BERT获得词向量后,在词向量基础上加入分类器。例如对于句子对或实体对分类任务,在预训练的基础上,根据下游任务进行微调,BERT模型获取最后一层的表征,加上softmax层预测概率。最后一层的表征可以学习到语义级别的信息,并且利用了前面各层的信息。
本实施例中,通过医学语料数据集对BERT模型进行训练,使得BERT模型学习到丰富 的语义信息,保证了疾病实体匹配的准确性。
步骤S205,根据正样本和负样本训练初始疾病实体匹配模型,得到疾病实体匹配模型。
具体地,服务器将正样本和负样本输入初始疾病实体匹配模型,初始疾病实体匹配模型根据输入的样本分别输出匹配预测结果,匹配预测结果可以是一个二分类的结果。
初始疾病实体匹配模型根据匹配预测结果与样本标签计算模型损失,其中,正样本的样本标签取一个值,负样本的样本标签取另一个值。服务器以减小模型损失为目标对初始疾病实体匹配模型进行参数调整,然后根据正样本和负样本继续对初始疾病实体匹配模型进行训练,直至模型收敛,得到疾病实体匹配模型。
在一个实施例中,可以根据Focal Loss损失函数计算模型损失。
步骤S206,获取待匹配实体。
其中,待匹配实体为输入的疾病实体,用于疾病实体匹配。
具体地,得到疾病实体匹配模型后即可进行疾病实体匹配。用户可以通过终端输入待匹配实体,由终端将待匹配实体发送至服务器。
步骤S207,将待匹配实体输入疾病实体匹配模型进行实体匹配,得到实体匹配结果。
具体地,服务器将待匹配实体输入疾病实体匹配模型,疾病实体匹配模型既可以对单独的待匹配实体进行实体匹配,输出与之匹配的疾病实体作为匹配结果;也可以对多个待匹配实体进行处理,输出多个待匹配实体中的匹配疾病实体对作为实体匹配结果。
本实施例中,获取疾病实体匹配词典以及候选疾病实体后,对候选疾病实体进行两两组合以构建负样本,以疾病实体匹配词典作为正样本;将正样本和负样本输入初始疾病实体匹配模型以进行充分训练,初始疾病实体匹配模型可以是完成预训练的BERT模型,具有丰富的语义信息,当训练样本规模较小时也可以获得精准的匹配效果,缩短了训练所需时间,提高了疾病实体匹配模型的训练效率;训练完成后,疾病实体匹配模型即可对输入的待匹配实体进行实体匹配,提高了疾病实体匹配的效率。
进一步的,上述步骤S201之前还可以包括:获取疾病语料信息;通过语义信息识别疾病语料信息中的匹配疾病实体对;基于识别到的匹配疾病实体对构建疾病实体匹配词典。
其中,疾病语料信息可以是疾病相关的语料信息。
具体地,服务器获取疾病语料信息,疾病语料信息可以通过爬虫获取。爬虫可以爬取疾病相关的词条页面,得到疾病语料信息。服务器根据语义知识库对疾病语料信息进行语义标注,根据语义标注结果得到疾病语料信息中的匹配疾病实体对。举例说明,疾病相关的词条页面中记录了“Y1又名Y2”,服务器通过语义信息得到Y1和Y2可以作为匹配疾病实体对。根据识别到的匹配疾病实体对,服务器可以构建出疾病实体匹配词典。
疾病语料信息也可以人工选取并输入服务器,匹配疾病实体对可以由人工对疾病语料信息进行标注。
本实施例中,基于疾病语料信息构建的疾病实体匹配词典用于训练初始疾病实体匹配模型,保证了模型训练的顺利实现。
进一步的,上述步骤S203可以包括:获取候选疾病实体对集合在疾病实体匹配词典中的补集;从补集中随机抽取预设数量的候选疾病实体对;计算抽取到的候选疾病实体对的实体相似度;筛选实体相似度小于相似度阈值的候选疾病实体对。
具体地,服务器先求候选疾病实体对集合在疾病实体匹配词典中的补集,从而删除已经存在于疾病实体匹配词典中的候选疾病实体对,再从补集中抽取预设数量的候选疾病实体对。
服务器计算实体相似度,实体相似度是候选疾病实体对中两个候选疾病实体间的相似度。实体相似度的计算有多种方法,例如通过Jaccard系数、N-Gram(又称N元模型)、Levenshtein距离(也称文本编辑距离)、余弦相似度等方法计算实体相似度。服务器可以单独采用上述的一种方法,也可以综合采用上述方法中的多种。
其中,采用Jaccard系数时,将候选疾病实体以字符为单位进行划分,计算公式如下:
Figure PCTCN2021090810-appb-000002
其中,A和B表示候选疾病实体,Jaccard(A,B)表示实体相似度,len(A∩B)表示A与B中相同字符的个数,len(A∪B)表示组成A与B所需的非重复字符的个数。
在通过N-Gram计算实体相似度时,将候选疾病实体按长度N切分得到词组,其中,上一个词组的尾为下一个词组的头,例如,将“糖尿病”解析为{“$糖”,“糖尿”,“尿病”,“病$”},其中$为填充字符,N值一般取2或者3。再以如下公式计算实体相似度:
Figure PCTCN2021090810-appb-000003
其中,M和N表示候选疾病实体,Jaccard(M,N)是M与N之间的实体相似度;len(M∩N)表示M与N中相同词组的个数,len(M∪N)表示组成M与N所需的非重复词组的个数。
当采用Levenshtein距离时,Levenshtein距离越小,实体相似度越高。
得到实体相似度之后,服务器获取预设的相似度阈值,将实体相似度与相似度阈值相比较,删去实体相似度大于或等于相似度阈值的候选疾病实体对,保留实体相似度小于相似度阈值的候选疾病实体对,以去除具有较高相似度的候选疾病实体对。
候选疾病实体对将作为负样本,已经存在于疾病实体匹配词典中的候选疾病实体对以及实体相似度较高的候选疾病实体对将对模型训练产生负面影响,需要进行去除。
本实施例中,通过对候选疾病实体对集合求补集,以及计算实体对相似度,从而去除相似度较高的候选疾病实体对,保证了根据候选疾病实体对构建的负样本的准确性。
进一步的,上述步骤S205可以包括:将正样本和负样本各自进行拼接,并添加样本标签,得到待处理样本;将待处理样本输入初始疾病实体匹配模型的网络层,得到待处理样本的表征向量;对表征向量进行计算,输出匹配预测概率;根据匹配预测概率和样本标签计算模型损失;根据模型损失调整初始疾病实体匹配模型的模型参数,直至模型收敛,得到疾病实体匹配模型。
具体地,正样本和负样本同时输入初始疾病实体匹配模型。初始疾病实体匹配模型对正样本和负样本的处理方式相同,在两个候选疾病实体间添加【SEP】字符,然后拼接在一起;再在拼接后的字符串首尾分别加【CLS】、【SEP】字符;服务器还可以添加样本标签,其中,正样本的样本标签一致,负样本的样本标签一致,得到待处理样本。
待处理样本被输入初始疾病实体匹配模型的网络层,输出待处理样本的表征向量sequence_output,在一个实施例中,表征向量的维度可以是1*768。服务器将表征向量进行矩阵运算,再乘上偏置矩阵[1,2],再加上softmax(逻辑回归)层,得到匹配预测概率,匹配预测概率是1*2的向量,分别表示两个实体匹配和不匹配的概率。服务器根据匹配预测概率和样本标签计算交叉熵得到模型损失,以减小模型损失为目标调整初始疾病实体匹配模型的模型参数,然后重新进行训练,直至模型收敛,得到疾病实体匹配模型。当模型收敛时,模型损失小于预设的损失阈值。
本实施例中,对样本进行处理输出匹配预测概率,并根据样本标签计算模型损失,根据模型损失对模型进行微调直至模型收敛,得到的疾病实体匹配模型可以准确地进行疾病实体的匹配判断。
进一步的,在一个实施例中,上述步骤S207可以包括:获取疾病实体词典;将待匹配实体与疾病实体词典中的各疾病实体进行组合,得到第一待匹配实体对;将第一待匹配实体对输入疾病实体匹配模型,得到匹配疾病实体对;根据匹配疾病实体对,在疾病实体词典中确定与待匹配实体相匹配的疾病实体,并将确定的疾病实体作为实体匹配结果。
其中,疾病实体词典可以是记录疾病实体的词典。
具体地,可以使用疾病实体匹配模型进行单个待匹配疾病实体的匹配。用户可以通过 终端输入待匹配实体。服务器获取待匹配实体,并读取存储的疾病实体词典。疾病实体词典中记录了大量的疾病实体,服务器将待匹配实体与疾病实体词典中的各疾病实体逐一组合,得到多组第一待匹配实体对。服务器将第一待匹配实体对输入疾病实体匹配模型,以判断第一待匹配实体对中的待匹配实体与疾病实体是否匹配,若可以匹配,将被标记为匹配疾病实体对。服务器将匹配疾病实体对中来自疾病实体词典的疾病实体作为实体匹配结果,并将实体匹配结果输出至终端,以展示与待匹配实体相匹配的疾病实体,使得用户无需再从互联网中搜索、查找与待匹配实体相关的疾病实体,方便高效。
服务器还可以查询待匹配实体是否存在于疾病实体词典,若不存在,则将待匹配实体补充到疾病实体词典中,以扩充疾病实体词典,提高对待匹配实体的匹配能力。
本实施例中,只需输入待匹配实体,疾病实体匹配模型将待匹配实体与疾病实体词典中的疾病实体一一进行匹配判断,可以快速地对待匹配实体实现实体匹配。
进一步的,在另一个实施例中,上述步骤S207还可以包括:对待匹配实体进行两两组合,得到第二待匹配实体对;将第二待匹配实体对输入疾病实体匹配模型,得到第二待匹配实体对中的匹配疾病实体对,并将得到的匹配疾病实体对作为实体匹配结果。
具体地,疾病实体匹配模型还可以同时对多个待匹配实体进行处理,输出多个待匹配实体中的匹配疾病实体对。
在应用时,用户可以同时输入多个待匹配实体,服务器先对多个待匹配实体进行两两组合得到第二待匹配实体对,然后将第二待匹配实体对输入疾病实体匹配模型,即可快速识别出多个待匹配实体中存在的匹配疾病实体对,并将得到的匹配疾病实体对作为实体匹配结果输出至终端进行展示。
本实施例中,从多个待匹配实体中筛选匹配疾病实体对时,将待匹配实体两两组合输入疾病实体匹配模型,即可快速对所有实体组合进行判断,提高了匹配效率。
本申请中的基于语音语义的疾病实体匹配方法涉及人工智能领域中的神经网络、机器学习和自然语言处理;此外,还可以涉及智慧城市领域中的智慧医疗。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
进一步参考图3,作为对上述图2所示方法的实现,本申请提供了一种基于语音语义的疾病实体匹配装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图3所示,本实施例所述的基于语音语义的疾病实体匹配装置300包括:第一获取模块301、实体组合模块302、实体对抽取模块303、样本输入模块304、模型训练模块305、第二获取模块306以及实体匹配模块307,其中:
第一获取模块301,用于获取疾病实体匹配词典以及候选疾病实体;其中,疾病实体匹配词典中包括匹配疾病实体对。
实体组合模块302,用于对候选疾病实体进行两两组合,得到候选疾病实体对集合。
实体对抽取模块303,用于从候选疾病实体对集合中随机抽取候选疾病实体对。
样本输入模块304,用于以抽取到的候选疾病实体对作为负样本、匹配疾病实体对作为正样本,将正样本和负样本输入初始疾病实体匹配模型;其中,初始疾病实体匹配模型为完成预训练的BERT模型。
模型训练模块305,用于根据正样本和负样本训练初始疾病实体匹配模型,得到疾病实体匹配模型。
第二获取模块306,用于获取待匹配实体。
实体匹配模块307,用于将待匹配实体输入疾病实体匹配模型进行实体匹配,得到实体匹配结果。
本实施例中,获取疾病实体匹配词典以及候选疾病实体后,对候选疾病实体进行两两组合以构建负样本,以疾病实体匹配词典作为正样本;将正样本和负样本输入初始疾病实体匹配模型以进行充分训练,初始疾病实体匹配模型可以是完成预训练的BERT模型,具有丰富的语义信息,当训练样本规模较小时也可以获得精准的匹配效果,缩短了训练所需时间,提高了疾病实体匹配模型的训练效率;训练完成后,疾病实体匹配模型即可对输入的待匹配实体进行实体匹配,提高了疾病实体匹配的效率。
在本实施例的一些可选的实现方式中,上述基于语音语义的疾病实体匹配装置300还包括:信息获取模块、实体对识别模块以及词典构建模块,其中:
信息获取模块,用于获取疾病语料信息。
实体对识别模块,用于通过语义信息识别疾病语料信息中的匹配疾病实体对。
词典构建模块,用于基于识别到的匹配疾病实体对构建疾病实体匹配词典。
本实施例中,基于疾病语料信息构建的疾病实体匹配词典用于训练初始疾病实体匹配模型,保证了模型训练的顺利实现。
在本实施例的一些可选的实现方式中,上述实体对抽取模块303包括:补集获取子模块、实体对抽取子模块、相似计算子模块以及实体对筛选子模块,其中:
补集获取子模块,用于获取候选疾病实体对集合在疾病实体匹配词典中的补集。
实体对抽取子模块,用于从补集中随机抽取预设数量的候选疾病实体对。
相似计算子模块,用于计算抽取到的候选疾病实体对的实体相似度。
实体对筛选子模块,用于筛选实体相似度小于相似度阈值的候选疾病实体对。
本实施例中,通过对候选疾病实体对集合求补集,以及计算实体对相似度,从而去除相似度较高的候选疾病实体对,保证了根据候选疾病实体对构建的负样本的准确性。
在本实施例的一些可选的实现方式中,上述模型训练模块305包括:样本拼接子模块、样本输入子模块、向量计算子模块、损失计算子模块以及参数调整子模块,其中:
样本拼接子模块,用于将正样本和负样本各自进行拼接,并添加样本标签,得到待处理样本。
样本输入子模块,用于将待处理样本输入初始疾病实体匹配模型的网络层,得到待处理样本的表征向量。
向量计算子模块,用于对表征向量进行计算,输出匹配预测概率。
损失计算子模块,用于根据匹配预测概率和样本标签计算模型损失。
参数调整子模块,用于根据模型损失调整初始疾病实体匹配模型的模型参数,直至模型收敛,得到疾病实体匹配模型。
本实施例中,对样本进行处理输出匹配预测概率,并根据样本标签计算模型损失,根据模型损失对模型进行微调直至模型收敛,得到的疾病实体匹配模型可以准确地进行疾病实体的匹配判断。
在本实施例的一些可选的实现方式中,上述基于语音语义的疾病实体匹配装置300还包括:数据集获取模块以及数据集输入模块,其中:
数据集获取模块,用于获取医学语料数据集。
数据集输入模块,用于将医学语料数据集输入BERT模型以进行预训练,得到初始疾 病实体匹配模型。
本实施例中,通过医学语料数据集对BERT模型进行训练,使得BERT模型学习到丰富的语义信息,保证了疾病实体匹配的准确性。
在本实施例的一些可选的实现方式中,上述实体匹配模块307包括:词典获取子模块、第一组合子模块、第一输入子模块以及实体确定子模块,其中:
词典获取子模块,用于获取疾病实体匹配词典。
第一组合子模块,用于将待匹配实体与疾病实体匹配词典中的各疾病实体进行组合,得到第一待匹配实体对。
第一输入子模块,用于将第一待匹配实体对输入疾病实体匹配模型,得到匹配疾病实体对。
实体确定子模块,用于根据匹配疾病实体对,在疾病实体匹配词典中确定与待匹配实体相匹配的疾病实体,并将确定的疾病实体作为实体匹配结果。
本实施例中,只需输入待匹配实体,疾病实体匹配模型将待匹配实体与疾病实体词典中的疾病实体一一进行匹配判断,可以快速地对待匹配实体实现实体匹配。
在本实施例的另一些可选的实现方式中,上述实体匹配模块307基于语音语义的疾病实体匹配包括:第二组合子模块以及第二输入子模块,其中:
第二组合子模块,用于对待匹配实体进行两两组合,得到第二待匹配实体对。
第二输入子模块,用于将第二待匹配实体对输入疾病实体匹配模型,得到第二待匹配实体对中的匹配疾病实体对,并将得到的匹配疾病实体对作为实体匹配结果。
本实施例中,从多个待匹配实体中筛选匹配疾病实体对时,将待匹配实体两两组合输入疾病实体匹配模型,即可快速对所有实体组合进行判断,提高了匹配效率。
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图4,图4为本实施例计算机设备基本结构框图。
所述计算机设备4包括通过系统总线相互通信连接存储器41、处理器42、网络接口43。需要指出的是,图中仅示出了具有组件41-43的计算机设备4,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
所述计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。
所述存储器41至少包括一种类型的计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器41可以是所述计算机设备4的内部存储单元,例如该计算机设备4的硬盘或内存。在另一些实施例中,所述存储器41也可以是所述计算机设备4的外部存储设备,例如该计算机设备4上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器41还可以既包括所述计算机设备4的内部存储单元也包括其外部存储设备。本实施例中,所述存储器41通常用于存储安装于所述计算机设备4的操作系统和各类应用软件,例如基于语音语义的疾病实体匹配方法的计算机可读指令等。此外,所述存储器41还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器42在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器42通常用于控制所述计算机设备4的总体操作。本实施例中,所述处理器42用于运行所述存储器41中存储的计算机可读指令或者处理数据,例如运行所述基于语音语义的疾病实体匹配方法的计算机可读指令。
所述网络接口43可包括无线网络接口或有线网络接口,该网络接口43通常用于在所述计算机设备4与其他电子设备之间建立通信连接。
本实施例中提供的计算机设备可以执行上述基于语音语义的疾病实体匹配方法的步骤。此处基于语音语义的疾病实体匹配方法的步骤可以是上述各个实施例的基于语音语义的疾病实体匹配方法中的步骤。
本实施例中,获取疾病实体匹配词典以及候选疾病实体后,对候选疾病实体进行两两组合以构建负样本,以疾病实体匹配词典作为正样本;将正样本和负样本输入初始疾病实体匹配模型以进行充分训练,初始疾病实体匹配模型可以是完成预训练的BERT模型,具有丰富的语义信息,当训练样本规模较小时也可以获得精准的匹配效果,缩短了训练所需时间,提高了疾病实体匹配模型的训练效率;训练完成后,疾病实体匹配模型即可对输入的待匹配实体进行实体匹配,提高了疾病实体匹配的效率。
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令可被至少一个处理器执行,以使所述至少一个处理器执行如上述的基于语音语义的疾病实体匹配方法的步骤。
本实施例中,获取疾病实体匹配词典以及候选疾病实体后,对候选疾病实体进行两两组合以构建负样本,以疾病实体匹配词典作为正样本;将正样本和负样本输入初始疾病实体匹配模型以进行充分训练,初始疾病实体匹配模型可以是完成预训练的BERT模型,具有丰富的语义信息,当训练样本规模较小时也可以获得精准的匹配效果,缩短了训练所需时间,提高了疾病实体匹配模型的训练效率;训练完成后,疾病实体匹配模型即可对输入的待匹配实体进行实体匹配,提高了疾病实体匹配的效率。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术方案进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。

Claims (20)

  1. 一种基于语音语义的疾病实体匹配方法,包括下述步骤:
    获取疾病实体匹配词典以及候选疾病实体;其中,所述疾病实体匹配词典中包括匹配疾病实体对;
    对所述候选疾病实体进行两两组合,得到候选疾病实体对集合;
    从所述候选疾病实体对集合中随机抽取候选疾病实体对;
    以抽取到的候选疾病实体对作为负样本、所述匹配疾病实体对作为正样本,将所述正样本和所述负样本输入初始疾病实体匹配模型;其中,所述初始疾病实体匹配模型为完成预训练的BERT模型;
    根据所述正样本和所述负样本训练所述初始疾病实体匹配模型,得到疾病实体匹配模型;
    获取待匹配实体;
    将所述待匹配实体输入所述疾病实体匹配模型进行实体匹配,得到实体匹配结果。
  2. 根据权利要求1所述的基于语音语义的疾病实体匹配方法,其中,在所述获取疾病实体匹配词典以及候选疾病实体的步骤之前还包括:
    获取疾病语料信息;
    通过语义信息识别所述疾病语料信息中的匹配疾病实体对;
    基于识别到的匹配疾病实体对构建疾病实体匹配词典。
  3. 根据权利要求1所述的基于语音语义的疾病实体匹配方法,其中,所述从所述候选疾病实体对集合中随机抽取候选疾病实体对的步骤包括:
    获取所述候选疾病实体对集合在所述疾病实体匹配词典中的补集;
    从所述补集中随机抽取预设数量的候选疾病实体对;
    计算抽取到的候选疾病实体对的实体相似度;
    筛选实体相似度小于相似度阈值的候选疾病实体对。
  4. 根据权利要求1所述的基于语音语义的疾病实体匹配方法,其中,所述根据所述正样本和所述负样本训练所述初始疾病实体匹配模型,得到疾病实体匹配模型的步骤包括:
    将所述正样本和所述负样本各自进行拼接,并添加样本标签,得到待处理样本;
    将所述待处理样本输入所述初始疾病实体匹配模型的网络层,得到所述待处理样本的表征向量;
    对所述表征向量进行计算,输出匹配预测概率;
    根据所述匹配预测概率和所述样本标签计算模型损失;
    根据所述模型损失调整所述初始疾病实体匹配模型的模型参数,直至模型收敛,得到疾病实体匹配模型。
  5. 根据权利要求1所述的基于语音语义的疾病实体匹配方法,其中,在所述以抽取到的候选疾病实体对作为负样本、所述匹配疾病实体对作为正样本,将所述正样本和所述负样本输入初始疾病实体匹配模型的步骤之前还包括:
    获取医学语料数据集;
    将所述医学语料数据集输入BERT模型以进行预训练,得到初始疾病实体匹配模型。
  6. 根据权利要求1所述的基于语音语义的疾病实体匹配方法,其中,所述将所述待匹配实体输入所述疾病实体匹配模型进行实体匹配,得到实体匹配结果的步骤包括:
    获取疾病实体词典;
    将所述待匹配实体与所述疾病实体词典中的各疾病实体进行组合,得到第一待匹配实体对;
    将所述第一待匹配实体对输入所述疾病实体匹配模型,得到匹配疾病实体对;
    根据所述匹配疾病实体对,在所述疾病实体词典中确定与所述待匹配实体相匹配的疾病实体,并将确定的疾病实体作为实体匹配结果。
  7. 根据权利要求1所述的基于语音语义的疾病实体匹配方法,其中,所述将所述待匹 配实体输入所述疾病实体匹配模型进行实体匹配,得到实体匹配结果的步骤包括:
    对所述待匹配实体进行两两组合,得到第二待匹配实体对;
    将所述第二待匹配实体对输入所述疾病实体匹配模型,得到所述第二待匹配实体对中的匹配疾病实体对,并将得到的匹配疾病实体对作为实体匹配结果。
  8. 一种基于语音语义的疾病实体匹配装置,包括:
    第一获取模块,用于获取疾病实体匹配词典以及候选疾病实体;其中,所述疾病实体匹配词典中包括匹配疾病实体对;
    实体组合模块,用于对所述候选疾病实体进行两两组合,得到候选疾病实体对集合;
    实体对抽取模块,用于从所述候选疾病实体对集合中随机抽取候选疾病实体对;
    样本输入模块,用于以抽取到的候选疾病实体对作为负样本、所述匹配疾病实体对作为正样本,将所述正样本和所述负样本输入初始疾病实体匹配模型;其中,所述初始疾病实体匹配模型为完成预训练的BERT模型;
    模型训练模块,用于根据所述正样本和所述负样本训练所述初始疾病实体匹配模型,得到疾病实体匹配模型;
    第二获取模块,用于获取待匹配实体;
    实体匹配模块,用于将所述待匹配实体输入所述疾病实体匹配模型进行实体匹配,得到实体匹配结果。
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取疾病实体匹配词典以及候选疾病实体;其中,所述疾病实体匹配词典中包括匹配疾病实体对;
    对所述候选疾病实体进行两两组合,得到候选疾病实体对集合;
    从所述候选疾病实体对集合中随机抽取候选疾病实体对;
    以抽取到的候选疾病实体对作为负样本、所述匹配疾病实体对作为正样本,将所述正样本和所述负样本输入初始疾病实体匹配模型;其中,所述初始疾病实体匹配模型为完成预训练的BERT模型;
    根据所述正样本和所述负样本训练所述初始疾病实体匹配模型,得到疾病实体匹配模型;
    获取待匹配实体;
    将所述待匹配实体输入所述疾病实体匹配模型进行实体匹配,得到实体匹配结果。
  10. 根据权利要求9所述的计算机设备,其中,所述从所述候选疾病实体对集合中随机抽取候选疾病实体对的步骤包括:
    获取所述候选疾病实体对集合在所述疾病实体匹配词典中的补集;
    从所述补集中随机抽取预设数量的候选疾病实体对;
    计算抽取到的候选疾病实体对的实体相似度;
    筛选实体相似度小于相似度阈值的候选疾病实体对。
  11. 根据权利要求9所述的计算机设备,其中,所述根据所述正样本和所述负样本训练所述初始疾病实体匹配模型,得到疾病实体匹配模型的步骤包括:
    将所述正样本和所述负样本各自进行拼接,并添加样本标签,得到待处理样本;
    将所述待处理样本输入所述初始疾病实体匹配模型的网络层,得到所述待处理样本的表征向量;
    对所述表征向量进行计算,输出匹配预测概率;
    根据所述匹配预测概率和所述样本标签计算模型损失;
    根据所述模型损失调整所述初始疾病实体匹配模型的模型参数,直至模型收敛,得到疾病实体匹配模型。
  12. 根据权利要求9所述的计算机设备,其中,在所述以抽取到的候选疾病实体对作为 负样本、所述匹配疾病实体对作为正样本,将所述正样本和所述负样本输入初始疾病实体匹配模型的步骤之前,所述处理器执行所述计算机可读指令时还实现如下步骤:
    获取医学语料数据集;
    将所述医学语料数据集输入BERT模型以进行预训练,得到初始疾病实体匹配模型。
  13. 根据权利要求9所述的计算机设备,其中,所述将所述待匹配实体输入所述疾病实体匹配模型进行实体匹配,得到实体匹配结果的步骤包括:
    获取疾病实体词典;
    将所述待匹配实体与所述疾病实体词典中的各疾病实体进行组合,得到第一待匹配实体对;
    将所述第一待匹配实体对输入所述疾病实体匹配模型,得到匹配疾病实体对;
    根据所述匹配疾病实体对,在所述疾病实体词典中确定与所述待匹配实体相匹配的疾病实体,并将确定的疾病实体作为实体匹配结果。
  14. 根据权利要求9所述的计算机设备,其中,所述将所述待匹配实体输入所述疾病实体匹配模型进行实体匹配,得到实体匹配结果的步骤包括:
    对所述待匹配实体进行两两组合,得到第二待匹配实体对;
    将所述第二待匹配实体对输入所述疾病实体匹配模型,得到所述第二待匹配实体对中的匹配疾病实体对,并将得到的匹配疾病实体对作为实体匹配结果。
  15. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令;其中,所述计算机可读指令被处理器执行时实现如下步骤:
    获取疾病实体匹配词典以及候选疾病实体;其中,所述疾病实体匹配词典中包括匹配疾病实体对;
    对所述候选疾病实体进行两两组合,得到候选疾病实体对集合;
    从所述候选疾病实体对集合中随机抽取候选疾病实体对;
    以抽取到的候选疾病实体对作为负样本、所述匹配疾病实体对作为正样本,将所述正样本和所述负样本输入初始疾病实体匹配模型;其中,所述初始疾病实体匹配模型为完成预训练的BERT模型;
    根据所述正样本和所述负样本训练所述初始疾病实体匹配模型,得到疾病实体匹配模型;
    获取待匹配实体;
    将所述待匹配实体输入所述疾病实体匹配模型进行实体匹配,得到实体匹配结果。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述从所述候选疾病实体对集合中随机抽取候选疾病实体对的步骤包括:
    获取所述候选疾病实体对集合在所述疾病实体匹配词典中的补集;
    从所述补集中随机抽取预设数量的候选疾病实体对;
    计算抽取到的候选疾病实体对的实体相似度;
    筛选实体相似度小于相似度阈值的候选疾病实体对。
  17. 根据权利要求15所述的计算机可读存储介质,其中,所述根据所述正样本和所述负样本训练所述初始疾病实体匹配模型,得到疾病实体匹配模型的步骤包括:
    将所述正样本和所述负样本各自进行拼接,并添加样本标签,得到待处理样本;
    将所述待处理样本输入所述初始疾病实体匹配模型的网络层,得到所述待处理样本的表征向量;
    对所述表征向量进行计算,输出匹配预测概率;
    根据所述匹配预测概率和所述样本标签计算模型损失;
    根据所述模型损失调整所述初始疾病实体匹配模型的模型参数,直至模型收敛,得到疾病实体匹配模型。
  18. 根据权利要求15所述的计算机可读存储介质,其中,在所述以抽取到的候选疾病 实体对作为负样本、所述匹配疾病实体对作为正样本,将所述正样本和所述负样本输入初始疾病实体匹配模型的步骤之前,所述计算机可读指令被处理器执行时还实现如下步骤:
    获取医学语料数据集;
    将所述医学语料数据集输入BERT模型以进行预训练,得到初始疾病实体匹配模型。
  19. 根据权利要求15所述的计算机可读存储介质,其中,所述将所述待匹配实体输入所述疾病实体匹配模型进行实体匹配,得到实体匹配结果的步骤包括:
    获取疾病实体词典;
    将所述待匹配实体与所述疾病实体词典中的各疾病实体进行组合,得到第一待匹配实体对;
    将所述第一待匹配实体对输入所述疾病实体匹配模型,得到匹配疾病实体对;
    根据所述匹配疾病实体对,在所述疾病实体词典中确定与所述待匹配实体相匹配的疾病实体,并将确定的疾病实体作为实体匹配结果。
  20. 根据权利要求15所述的计算机可读存储介质,其中,所述将所述待匹配实体输入所述疾病实体匹配模型进行实体匹配,得到实体匹配结果的步骤包括:
    对所述待匹配实体进行两两组合,得到第二待匹配实体对;
    将所述第二待匹配实体对输入所述疾病实体匹配模型,得到所述第二待匹配实体对中的匹配疾病实体对,并将得到的匹配疾病实体对作为实体匹配结果。
PCT/CN2021/090810 2020-10-10 2021-04-29 基于语音语义的疾病实体匹配方法、装置及计算机设备 WO2022073341A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011080585.4 2020-10-10
CN202011080585.4A CN112199954B (zh) 2020-10-10 2020-10-10 基于语音语义的疾病实体匹配方法、装置及计算机设备

Publications (1)

Publication Number Publication Date
WO2022073341A1 true WO2022073341A1 (zh) 2022-04-14

Family

ID=74013487

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/090810 WO2022073341A1 (zh) 2020-10-10 2021-04-29 基于语音语义的疾病实体匹配方法、装置及计算机设备

Country Status (2)

Country Link
CN (1) CN112199954B (zh)
WO (1) WO2022073341A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112199954B (zh) * 2020-10-10 2023-11-10 平安科技(深圳)有限公司 基于语音语义的疾病实体匹配方法、装置及计算机设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111192692A (zh) * 2020-01-02 2020-05-22 上海联影智能医疗科技有限公司 一种实体关系的确定方法、装置、电子设备及存储介质
CN111274397A (zh) * 2020-01-20 2020-06-12 北京百度网讯科技有限公司 建立实体关系检测模型的方法以及装置
CN112199954A (zh) * 2020-10-10 2021-01-08 平安科技(深圳)有限公司 基于语音语义的疾病实体匹配方法、装置及计算机设备

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106934220B (zh) * 2017-02-24 2019-07-19 黑龙江特士信息技术有限公司 面向多数据源的疾病类实体识别方法及装置
CN108628824A (zh) * 2018-04-08 2018-10-09 上海熙业信息科技有限公司 一种基于中文电子病历的实体识别方法
CN109978022B (zh) * 2019-03-08 2022-07-29 腾讯科技(深圳)有限公司 一种医疗文本信息处理方法及装置、存储介质
CN111445968A (zh) * 2020-03-16 2020-07-24 平安国际智慧城市科技股份有限公司 电子病历查询方法、装置、计算机设备和存储介质

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111192692A (zh) * 2020-01-02 2020-05-22 上海联影智能医疗科技有限公司 一种实体关系的确定方法、装置、电子设备及存储介质
CN111274397A (zh) * 2020-01-20 2020-06-12 北京百度网讯科技有限公司 建立实体关系检测模型的方法以及装置
CN112199954A (zh) * 2020-10-10 2021-01-08 平安科技(深圳)有限公司 基于语音语义的疾病实体匹配方法、装置及计算机设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZONGCHENG JI; QIANG WEI; HUA XU: "BERT-based Ranking for Biomedical Entity Normalization", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 9 August 2019 (2019-08-09), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081459034 *

Also Published As

Publication number Publication date
CN112199954A (zh) 2021-01-08
CN112199954B (zh) 2023-11-10

Similar Documents

Publication Publication Date Title
CN109582949B (zh) 事件元素抽取方法、装置、计算设备及存储介质
US11288593B2 (en) Method, apparatus and device for extracting information
CN108255805B (zh) 舆情分析方法及装置、存储介质、电子设备
CN107153641B (zh) 评论信息确定方法、装置、服务器及存储介质
WO2021068329A1 (zh) 中文命名实体识别方法、装置及计算机可读存储介质
US10755048B2 (en) Artificial intelligence based method and apparatus for segmenting sentence
CN112215008B (zh) 基于语义理解的实体识别方法、装置、计算机设备和介质
WO2021121198A1 (zh) 基于语义相似度的实体关系抽取方法、装置、设备及介质
US20170161619A1 (en) Concept-Based Navigation
WO2021135469A1 (zh) 基于机器学习的信息抽取方法、装置、计算机设备及介质
US10579655B2 (en) Method and apparatus for compressing topic model
CN112287069B (zh) 基于语音语义的信息检索方法、装置及计算机设备
WO2020147409A1 (zh) 一种文本分类方法、装置、计算机设备及存储介质
WO2021174864A1 (zh) 基于少量训练样本的信息抽取方法及装置
WO2023024422A1 (zh) 基于问诊会话的辅助诊断方法、装置及计算机设备
WO2023134057A1 (zh) 事务信息查询方法、装置、计算机设备及存储介质
WO2022222300A1 (zh) 开放关系抽取方法、装置、电子设备及存储介质
CN112632278A (zh) 一种基于多标签分类的标注方法、装置、设备及存储介质
CN111737997A (zh) 一种文本相似度确定方法、设备及储存介质
WO2021063089A1 (zh) 规则匹配方法、规则匹配装置、存储介质及电子设备
CN113722438A (zh) 基于句向量模型的句向量生成方法、装置及计算机设备
CN111798118B (zh) 企业经营风险监控方法及装置
CN113987125A (zh) 基于神经网络的文本结构化信息提取方法、及其相关设备
CN112417887A (zh) 敏感词句识别模型处理方法、及其相关设备
CN112926308B (zh) 匹配正文的方法、装置、设备、存储介质以及程序产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21876864

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21876864

Country of ref document: EP

Kind code of ref document: A1