WO2020220635A1 - Pharmaceutical drug classification method and apparatus, computer device and storage medium - Google Patents

Pharmaceutical drug classification method and apparatus, computer device and storage medium Download PDF

Info

Publication number
WO2020220635A1
WO2020220635A1 PCT/CN2019/117240 CN2019117240W WO2020220635A1 WO 2020220635 A1 WO2020220635 A1 WO 2020220635A1 CN 2019117240 W CN2019117240 W CN 2019117240W WO 2020220635 A1 WO2020220635 A1 WO 2020220635A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
word vector
target feature
word
euclidean distance
Prior art date
Application number
PCT/CN2019/117240
Other languages
French (fr)
Chinese (zh)
Inventor
陈娴娴
阮晓雯
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Priority to SG11202008417RA priority Critical patent/SG11202008417RA/en
Publication of WO2020220635A1 publication Critical patent/WO2020220635A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • the embodiments of the application relate to the field of drug classification, in particular to a method, device, computer equipment and storage medium for drug classification.
  • Drug classification management is an internationally accepted management method. It divides drugs into prescription drugs and non-prescription drugs and makes corresponding management regulations based on the safety and effectiveness principles of drugs, according to their varieties, specifications, indications, dosages and routes of administration. Its significance is to ensure the safety of people's medication.
  • the drug classification model mainly starts with a supervised model, which requires a large amount of labor costs to label samples in the previous period.
  • the inventor realizes that manual labeling often has inaccurate labeling and imperfect classification. For this reason, a lot of manpower is required to perform maintenance operations such as adding and modifying categories. As a result, the classification of drugs is time-consuming and labor-intensive, and the accuracy of classification is also low.
  • the embodiments of the present application provide a drug classification method, device, computer equipment, and storage medium that can complete drug classification without marking.
  • a technical solution adopted in the embodiments created by this application is to provide a method for classifying drugs, including: obtaining, according to the user’s case information, a target feature word vector that characterizes the user’s condition and the use of drugs, wherein the The case information is text information, the target feature word vector includes a first word vector and a second word vector, the first word vector is obtained by extracting the text information through a neural network model, and the second word vector is The text information is statistically obtained after stop words are filtered; the target feature word vector is input into a preset drug classification model, where the drug classification model is clustered by calculating the distance between different feature word vectors Class unsupervised training model; classify and label the used drugs according to the cluster set of the used drugs output by the drug classification model, wherein the classification and label content is the cluster set of the used drugs At least one high-frequency word.
  • an embodiment of the present application also provides a medicine classification device, including: an acquisition module for acquiring a target feature word vector that characterizes the user's condition and the use of medicines according to the user's case information, wherein the case information is Text information, the target feature word vector includes a first word vector and a second word vector, the first word vector is obtained by extracting the text information through a neural network model, and the second word vector is obtained by comparing the text
  • the information is filtered through stop words and then statistics are obtained;
  • the processing module is used to input the target feature word vector into a preset drug classification model, where the drug classification model is calculated by calculating the distance between different feature word vectors
  • An unsupervised training model for clustering an execution module for classifying and labeling the used drugs according to the cluster set of the used drugs output by the drug classification model, wherein the classification is labeled as the use At least one high-frequency word is concentrated in the cluster of drugs.
  • an embodiment of the present application further provides a computer device including a memory and a processor.
  • the memory stores computer-readable instructions.
  • the processor executes the steps of a method for classifying medicines.
  • the method for classifying medicines includes the following steps: obtaining, according to the user’s case information, a target feature word vector that characterizes the user’s condition and the use of drugs, wherein the case information is text information ,
  • the target feature word vector includes a first word vector and a second word vector, the first word vector is obtained by extracting the text information through a neural network model, and the second word vector is obtained by extracting the text information
  • the stop words are filtered and obtained by statistics;
  • the target feature word vector is input into a preset drug classification model, where the drug classification model is an unsupervised training of clustering by calculating the distance between different feature word vectors Model; classify and label the used drugs according to the cluster set of the used drugs output by the drug classification model, wherein the classification and label content is at least one high-frequency word in the cluster set of the used drugs .
  • embodiments of the present application also provide a storage medium storing computer-readable instructions.
  • the method for classifying medicines includes the following steps: obtaining a target feature word vector that characterizes the user’s condition and the use of drugs according to the user’s case information, wherein the case information is text information, and the target feature word
  • the vector includes a first word vector and a second word vector, the first word vector is obtained by extracting the text information through a neural network model, and the second word vector is calculated by filtering the text information by stop words Obtain; input the target feature word vector into a preset drug classification model, where the drug classification model is an unsupervised training model that clusters by calculating the distance between different feature word vectors; according to the drug
  • the cluster set of used medicines output by the classification model classifies and annotates the used medicines, wherein the content of the classification and annotation is at least one high-frequency word in
  • the embodiments of the application can improve the efficiency of drug classification, and the use of case information can further strengthen the correspondence between drugs and disease conditions, and improve the accuracy of the classification results.
  • Fig. 1 is a schematic diagram of the basic flow of the method for classifying drugs according to an embodiment of the application
  • FIG. 2 is a schematic diagram of a process of collecting a first word vector through a neural network model according to an embodiment of the application
  • FIG. 3 is a schematic diagram of the process of extracting word vectors through keyword sets according to an embodiment of the application
  • FIG. 4 is a schematic diagram of a process of generating a first-level cluster set according to an embodiment of the application
  • FIG. 5 is a schematic diagram of a process of generating a secondary cluster set according to an embodiment of the application
  • FIG. 6 is a schematic diagram of a process of generating a three-level cluster set according to an embodiment of the application
  • FIG. 7 is a schematic diagram of three-level classification according to an embodiment of this application.
  • FIG. 8 is a schematic diagram of the basic structure of a drug classification device according to an embodiment of the application.
  • Fig. 9 is a block diagram of the basic structure of a computer device according to an embodiment of the application.
  • Fig. 1 is a schematic diagram of the basic flow of the drug classification method in this embodiment.
  • a drug classification method includes:
  • the target feature word vector that characterizes the user's condition and medication use according to the user's case information
  • the case information is text information
  • the target feature word vector includes a first word vector and a second word vector
  • the first A word vector is obtained by extracting the text information through a neural network model
  • the second word vector is obtained by performing statistics after filtering the text information by stop words
  • the user’s behavior information is recorded throughout the entire process.
  • the recorded behavior information includes: the user’s medical condition, the use of drugs, and the user’s laboratory results.
  • the above-mentioned behavior information is defined as the user’s medical condition. .
  • the foregoing medical condition information is all text information, but is not limited to this. According to different specific application scenarios, in some embodiments, the medical condition information further includes: picture information and sound information.
  • the target feature word vector in the case information is vector information that characterizes the user's condition and the use of drugs.
  • the method of extracting the target feature word vector can be used to extract the feature vector through a neural network model that has been trained to a convergent state.
  • the target feature word vector can be extracted by calculating the word frequency of the keywords in the case information.
  • the target feature word vector is first extracted through a neural network model, and then the word frequency statistics method is used for calculation, and finally the results obtained by the two calculation methods are combined to obtain the target feature word vector.
  • S1200 Input the target feature word vector into a preset drug classification model, where the drug classification model is an unsupervised training model that performs clustering by calculating the distance between different feature word vectors;
  • the target feature word vector is input into a preset drug classification model, where the drug classification model is an unsupervised training model for clustering by calculating the distance between different feature word vectors.
  • the drug classification model adopts an unsupervised model, and an unsupervised training model is used to cluster feature word vectors.
  • the unsupervised training model mainly calculates the inter-class distance between different feature word vectors and sets a distance with a measurement property. Threshold, cluster the feature word vectors whose distance between classes is less than the distance threshold to generate a cluster set.
  • each different cluster set is a classification category of the medicine.
  • the calculation of the distance between classes is actually calculating the similarity of the condition information of different drugs.
  • the smaller the distance between classes the closer the efficacy of different drugs.
  • the greater the distance between classes the greater the difference in efficacy of different drugs. Therefore, different classification categories can achieve different cures or curative effects.
  • the classification categories are divided into different levels, and after the first level division is completed, further classification is performed in different clusters.
  • the method adopted is to reduce the value of the distance threshold, so that the feature word vectors in the cluster set are further distinguished.
  • reducing the parameter value of the effective point spacing in different feature word vectors can make the intra-class distance of different feature word vectors more converge, and the convergence of the intra-class distance will further increase the inter-class distance between feature word vectors. Therefore, the differentiation between different feature word vectors in the cluster can be further increased, which provides a good condition for further subdividing the categories in the cluster.
  • the cluster set is divided into 3 levels, but not limited to, according to different specific application scenarios, the cluster set can be divided into: level 1, level 2, level 4, level 5 or more. .
  • S1300 Classify and label the used drugs according to the cluster set of the used drugs output by the drug classification model, wherein the classification label is at least one high-frequency word in the cluster set of the used drugs.
  • each cluster set and the drugs in the last set are labeled.
  • the labeling method of the cluster set is: extract the word with the highest frequency in the case information of each drug in the cluster set as the label name of the cluster set. In some embodiments, when there are multiple If the label name is yes, it is selected in turn according to the sorting result of frequency of occurrence. For the labeling of the drug name, the drug name is directly extracted from the case information for labeling.
  • the name of the drug and the medical condition information corresponding to the drug can be obtained by collecting the user's case information, the medical condition information corresponding to the drug name is converted into the target feature word vector, and the target The feature word vector is input as input data to the unsupervised drug classification model.
  • the drug classification model clusters drugs that can cure the same or similar conditions together to form a cluster category, and the cluster category can become a drug A category of classification.
  • the drug classification is completed by labeling the names of the drugs in the classification category.
  • This classification method can improve the efficiency of drug classification, and the use of case information can further strengthen the correspondence between drugs and disease conditions, and improve the accuracy of the classification results.
  • FIG. 2 is a schematic diagram of the process of collecting the first word vector through the neural network model in this embodiment.
  • S1100 includes:
  • the case information is transformed into a vector set that can be recognized or processed by the neural network model.
  • the method used is to convert the case information into a vector set through the word2vec model.
  • the case information can also be vectorized through TF-IDF (term frequency—inverse document frequency) technology.
  • S1112 input the behavior vector set into a preset feature extraction model, where the feature extraction model is a neural network model that is pre-trained to a convergent state and is used to extract behavior vectors to represent user behavior vectors;
  • the feature extraction model is a neural network model that is pre-trained to a convergent state and is used to extract behavior vectors to represent user behavior vectors;
  • the converted vector set is input into a preset feature extraction model, where the feature extraction model is pre-trained to a convergent state, and is used to extract a neural network model that represents user behavior vectors in a set of behavior vectors.
  • the feature extraction model is used to extract word vectors associated with user medical information and drug information in the vector set.
  • the training method is: collect training sample sets, which are composed of vector sets after conversion of several case information, manually calibrate the word vectors in each vector set, and then input the labeled vector sets into the neural network model in turn. After the neural network model extracts the excitation word vector, it calculates the distance between the excitation word vector and the label word vector. If the distance is greater than the set distance threshold, the weight of the neural network model is calibrated through the back propagation algorithm.
  • the vector set training is passed, and the vector set in the training sample set is trained by the above method until the
  • a set value for example, 98%
  • the feature extraction model trained to the convergent state can accurately extract the word vector associated with the user's condition information and drug information in the vector set, and the word vector is the user behavior vector.
  • the extracted user behavior vector can be used as the input data of the drug classification model.
  • the user behavior vector is defined as the first word vector.
  • the neural network model trained to convergence can quickly extract word vectors that record key information, simplifying the data processing procedures of the drug classification model, and improving the processing efficiency of the drug classification model.
  • FIG. 3 is a schematic diagram of the process of extracting word vectors through keyword sets in this embodiment.
  • stop words are words that are filtered out.
  • stop word list and record stop words obtained through statistics. For example, if words of verbs, adverbs, and adjectives are set as stop words, after filtering through the stop word list, the case information is removed. For the stop words with the above-mentioned part of speech, the case information after the stop words are removed generates a keyword set, which records the user's condition information and drug information in the keyword set.
  • the word frequency of each keyword in the keyword set is calculated.
  • the calculation method of the word frequency is:
  • the inverse document frequency is used to determine the importance of each keyword.
  • the size of the inverse document frequency is inversely proportional to the commonness of a word, and the inverse document frequency is The calculation method is:
  • the priority of each keyword is sorted by descending power, and according to the actual It is necessary to select the top keywords as the keywords to be converted. For example, extract the top 20 keywords as the keywords to be converted.
  • the determination of the number of keywords to be converted is not limited to this, according to specific application scenarios In some embodiments, the number of keywords to be converted can be any value.
  • S1124 Generate the second word vector according to the priority value of each keyword.
  • the keywords to be converted that have been filtered by the priority value are converted into the second word vector.
  • the first word vector is extracted by the neural network model. Because the relationship between the word vector extracted by the neural network model and the text information, it essentially carries people’s subjective will, and through repeated orientation Training and learning are obtained, but the neural network model has the defect that it is difficult to converge during the cross-training of multiple association relationships. Therefore, the extracted first word vector will have the problem of insufficient comprehensiveness of the extracted word vector or omission of the keyword vector.
  • the second word vector is calculated based on the filtering of stop words, without any personal will during the statistics, and can most directly reflect the distribution of each keyword, and extract the word vector more comprehensively but without emphasis.
  • the target feature word vector generated after the merging has more comprehensive data, which can not only highlight the feature word vector that people pay attention to, but also fully integrate the feature word vector existing in the customer view, so that the extracted
  • the data is comprehensive and focused, and comprehensive and focused data is conducive to improving the accuracy of the drug classification model. See step S1131.
  • the method of merging the first word vector and the second word vector is: add the word vector matrix composed of the first word vector and the word vector matrix composed of the second word vector, and the result of the operation is the target feature word vector
  • the vector matrix is the input data of the drug classification model.
  • the drug classification model generates a first-level cluster set, and the cluster set of the target feature word vector needs to be judged by calculating the Euclidean distance between the target feature word vector and different feature word vectors.
  • FIG. 4 is a schematic diagram of the process of generating a first-level cluster set in this embodiment.
  • S1200 includes:
  • S1211 calculate the first Euclidean distance between the target feature word vector and different feature word vectors
  • the distance between the target feature word vector and other feature word vectors needs to be calculated. Specifically, the Euclidean distance between the target feature word vector and different feature word vectors is calculated. Euclidean distance is collectively referred to as the first Euclidean distance. However, it is not limited to this. In some embodiments, the calculation method is to calculate the Mahalanobis distance or the cosine distance between the target feature word vector and different feature word vectors.
  • the first Euclidean distance between the target feature word vector and the different feature word vectors is compared with the set first distance threshold.
  • the first distance threshold is a threshold for measuring whether the feature word vectors meet the first screening condition, for example, the value of the first distance threshold is 0.5.
  • the target feature word vector should be clustered into the cluster set where the feature word vector is located. After clustering all the target feature word vectors of the case information, a first-level cluster set is generated, and the first-level cluster set is composed of at least one cluster set.
  • the drug classification model generates a secondary cluster set, and the secondary cluster set needs to be further refined clustering on the basis of the primary cluster set.
  • FIG. 5 is a schematic diagram of the process of generating a secondary cluster set in this embodiment.
  • the effective point spacing refers to the distance between classes that are not ignored in each feature word vector. Due to the efficiency of data calculation, Before calculating the inter-class distance, you need to filter the intra-class distance.
  • the filtering method is to set the parameter value of the effective point spacing.
  • the inter-class distance that is less than the effective point spacing in the inter-class distance will be judged as invalid. Therefore, , Decreasing the value of the parameter value of the effective point spacing will increase the diversity of the distance within the class, reveal more detailed parts of each feature word vector, and increase the difference between different feature word vectors in the same cluster. Conducive to two-level clustering.
  • the parameter value of the effective point spacing after correction is the first parameter value.
  • the value of the first parameter value is smaller than the parameter value of the effective point spacing set by the drug classification model before a cluster set.
  • the drug classification model After setting the first parameter value, the drug classification model performs a secondary clustering in each cluster in the primary clustering set.
  • the second-level clustering method is: in the cluster set where the target feature word vector is located, the second Euclidean distance between the target feature word vector and other feature word vectors is calculated.
  • the calculation of the second Euclidean distance can be modified to calculate the Mahalanobis distance or the cosine distance between the target feature word vector and different feature word vectors.
  • the second Euclidean distance between the target feature word vector and different feature word vectors is compared with the set second distance threshold.
  • the second distance threshold is a threshold for measuring whether the feature word vectors meet the second screening condition, for example, the value of the second distance threshold is 0.1.
  • the target feature word vector should be clustered with which feature word vector or type of feature word vector in the cluster set where the target feature word vector is located.
  • the target feature word vector should be clustered into the cluster set where the feature word vector is located.
  • a secondary cluster set is generated, and the secondary cluster set is composed of at least one cluster set.
  • the drug classification model generates a three-level cluster set, and the three-level cluster set needs to be further refined clustering on the basis of the two-level cluster set.
  • FIG. 6, is a schematic diagram of the process of generating a three-level cluster set in this embodiment.
  • the filtering method is to set the parameter value of the effective point spacing.
  • the inter-class distance that is less than the effective point spacing in the inter-class distance will be judged as invalid. Therefore, , Decreasing the value of the parameter value of the effective point spacing will increase the diversity of the distance within the class, reveal more detailed parts of each feature word vector, and increase the difference between different feature word vectors in the same cluster. Conducive to three-level clustering.
  • the parameter value of the effective point spacing after correction is the second parameter value. The value of the second parameter value is smaller than the value of the first parameter.
  • the drug classification model After setting the second parameter value, the drug classification model performs three-level clustering in each cluster in the second-level cluster set.
  • the three-level clustering method is: in the cluster set where the target feature word vector is located, the third Euclidean distance between the target feature word vector and other feature word vectors is calculated. But it is not limited to this. In some embodiments, the calculation of the third Euclidean distance can be modified to calculate the Mahalanobis distance or the cosine distance between the target feature word vector and different feature word vectors.
  • the third Euclidean distance between the target feature word vector and different feature word vectors is compared with the set third distance threshold.
  • the third distance threshold is a threshold for measuring whether the feature word vectors meet the third filtering condition, for example, the value of the third distance threshold is 0.05.
  • Comparing the third Euclidean distance with the preset third distance threshold can determine the clustering set where the target feature word vector is located, and which feature word vector or type of feature word vector should be clustered with the target feature word vector.
  • the target feature word vector should be clustered into the cluster set where the feature word vector is located.
  • a three-level cluster set is generated, and the three-level cluster set is composed of at least one cluster set. So far, the three-level classification of drugs is completed, but the setting of the classification level is not limited to this.
  • the parameter value of the effective effective point spacing and the distance threshold can be further corrected to further refine the classification.
  • FIG. 7 is a schematic diagram of the three-level classification in this embodiment.
  • the classification of drugs is divided into three levels, namely: a first-level cluster set 11, a second-level cluster set 12, and a third-level cluster set 13.
  • the cluster sets of three different levels are arranged in a dendrogram.
  • the embodiment of the present application also provides a medicine classification device.
  • FIG. 8 is a schematic diagram of the basic structure of the medicine classification device of this embodiment.
  • a medicine classification device includes: an acquisition module 2100, a processing module 2200, and an execution module 2300.
  • the acquisition module 2100 is configured to acquire the target feature word vector that characterizes the user's condition and the use of drugs according to the user's case information, where the case information is text information, and the target feature word vector includes a first word vector and a second word vector, The first word vector is obtained by extracting the text information through a neural network model, and the second word vector is obtained by performing stop word filtering on the text information and then performing statistics;
  • the processing module 2200 is used to obtain the target feature word vector Input to the preset drug classification model, where the drug classification model is an unsupervised training model that clusters by calculating the distance between different feature word vectors;
  • the execution module 2300 is used to cluster the drugs used according to the output of the drug classification model.
  • the cluster is used to label the classification information of the used drugs, wherein the content of the classification and annotation is at least one high-frequency word in the cluster set of the used drugs.
  • the drug classification device When the drug classification device classifies drugs, it can obtain the name of the drug and the disease information corresponding to the drug by collecting the user's case information, convert the disease information corresponding to the drug name into a target feature word vector, and convert the target feature word
  • the vector is input as input data into the unsupervised drug classification model.
  • the drug classification model clusters drugs that can cure the same or similar conditions together to form a cluster category. This cluster category can become a drug classification model. A category.
  • the drug classification is completed by labeling the names of the drugs in the classification category.
  • This classification method can improve the efficiency of drug classification, and the use of case information can further strengthen the correspondence between drugs and disease conditions, and improve the accuracy of the classification results.
  • the target feature word vector includes: a first word vector
  • the medicine classification device includes: a first conversion submodule, a first processing submodule, and a first execution submodule.
  • the first conversion sub-module is used to convert the case information into a behavior vector set
  • the first processing sub-module is used to input the behavior vector set into a preset feature extraction model, where the feature extraction model is pre-trained to a convergent state , Is used to extract the neural network model of the behavior vector centrally representing the user behavior vector
  • the first execution sub-module is used to read the user behavior vector output by the feature extraction model, and define the user behavior vector as the first word vector.
  • the target feature word vector includes: a second word vector
  • the drug classification device includes: a first filtering submodule, a second processing submodule, a first calculation submodule, and a second execution submodule.
  • the first filtering submodule is used to filter case information through a preset stop word list to generate a keyword set
  • the second processing submodule is used to count the word frequency of each keyword in the keyword set and the inverse document of each keyword Frequency
  • the first calculation sub-module is used to calculate the priority value of each keyword by word frequency and inverse document frequency
  • the second execution sub-module is used to generate the second word vector according to the priority value of each keyword.
  • the drug classification device includes: a first merging sub-module for merging the first word vector and the second word vector to generate the target feature word vector.
  • the drug classification device includes: a first calculation submodule, a first comparison submodule, and a third execution submodule.
  • the first calculation sub-module is used to calculate the first Euclidean distance between the target feature word vector and different feature word vectors
  • the first comparison sub-module is used to compare the first Euclidean distance with a preset first distance threshold Perform comparison
  • the third execution sub-module is used to cluster the target feature vector to the cluster set represented by the first Euclidean distance to generate a first-level cluster set when the first Euclidean distance is less than the first distance threshold.
  • the drug classification device includes: a second calculation submodule, a second comparison submodule, and a fourth execution submodule.
  • the second calculation sub-module is used to correct the parameter value of the effective point spacing in the drug classification model to generate the first parameter value, and calculate the second parameter between the target feature word vector and different feature word vectors in the first-level clustering set.
  • the second comparison submodule is used to compare the second Euclidean distance with a preset second distance threshold, where the second distance threshold is less than the first distance threshold; the fourth execution submodule is used for When the second Euclidean distance is less than the second distance threshold, cluster the target feature vector to the cluster set represented by the second Euclidean distance to generate a secondary cluster set.
  • the drug classification device includes: a third calculation submodule, a third comparison submodule, and a fifth execution submodule.
  • the third calculation sub-module is used to correct the parameter value of the effective point spacing in the drug classification model to generate the second parameter value, and calculate the third parameter value between the target feature word vector and different feature word vectors in the secondary clustering set.
  • the third comparison sub-module is used to compare the third Euclidean distance with a preset third distance threshold, where the third distance threshold is less than the second Distance threshold; the fifth execution submodule is used to cluster the target feature vector into the cluster set represented by the third Euclidean distance to generate a three-level cluster set when the third Euclidean distance is less than the third distance threshold.
  • FIG. 9 is a block diagram of the basic structure of the computer device in this embodiment.
  • the computer equipment includes a processor, a storage medium, a memory, and a network interface connected through a system bus.
  • the storage medium may be volatile or non-volatile.
  • the storage medium of the computer device stores an operating system, a database, and computer-readable instructions.
  • the database may store control information sequences, which are readable by the computer.
  • the processor can realize a medicine classification method.
  • the processor of the computer equipment is used to provide calculation and control capabilities, and supports the operation of the entire computer equipment.
  • Computer readable instructions may be stored in the memory of the computer device, and when the computer readable instructions are executed by the processor, the processor can make the processor execute a medicine classification method.
  • the network interface of the computer device is used to connect and communicate with the terminal.
  • FIG. 9 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied.
  • the specific computer equipment may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • the processor is used to execute the specific functions of the acquisition module 2100, the processing module 2200, and the execution module 2300 in FIG. 8, and the memory stores the program codes and various data required to execute the above modules.
  • the network interface is used for data transmission between user terminals or servers.
  • the memory in this embodiment stores the program codes and data required to execute all the sub-modules in the medicine classification device, and the server can call the program codes and data of the server to execute the functions of all the sub-modules.
  • the computer equipment When the computer equipment classifies drugs, it can obtain the name of the drug and the medical condition information corresponding to the drug by collecting the user's case information, convert the medical condition information corresponding to the drug name into the target feature word vector, and convert the target feature word vector As input data, it is input into an unsupervised drug classification model.
  • the drug classification model clusters drugs that can cure the same or similar conditions together to form a cluster category. This cluster category can become a drug classification category. Finally, the drug classification is completed by labeling the names of the drugs in the classification category.
  • This classification method can improve the efficiency of drug classification, and the use of case information can further strengthen the correspondence between drugs and disease conditions, and improve the accuracy of the classification results.
  • the present application also provides a storage medium storing computer-readable instructions.
  • the computer-readable instructions are executed by one or more processors, the one or more processors execute the steps of the drug classification method in any of the foregoing embodiments.
  • the computer program can be stored in a computer readable storage medium. When executed, it may include the processes of the above-mentioned method embodiments.
  • the aforementioned storage medium may be a storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.

Abstract

A pharmaceutical drug classification method and apparatus, a computer device and a storage medium, comprising: on the basis of case information of a patient, obtaining target feature word vectors representing the state of illness of and the pharmaceutical drugs used by said patient; inputting the target feature word vectors into a preset pharmaceutical drug classification model, wherein said classification model is an unsupervised training model that implements clustering by means of calculating distances between different feature word vectors; on the basis of a cluster set of the pharmaceutical drugs used, said cluster set being outputted by said classification model, carrying out classification annotation on the pharmaceutical drugs used, wherein the classification annotation is at least one high-frequency word in said cluster set. The present classification method improves classification efficiency, and the use of case information therein further enhances the correspondence between the pharmaceutical drug and the state of illness, thus increasing the accuracy of classification results.

Description

药品分类方法、装置、计算机设备及存储介质Drug classification method, device, computer equipment and storage medium
本申请要求于2019年9月18日提交中国专利局、申请号为201910881521.5,发明名称为“药品分类方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 18, 2019, the application number is 201910881521.5, and the invention title is "methods, devices, computer equipment and storage media for drug classification", the entire contents of which are incorporated by reference In this application.
技术领域Technical field
本申请实施例涉及药品分类领域,尤其是一种药品分类方法、装置、计算机设备及存储介质。The embodiments of the application relate to the field of drug classification, in particular to a method, device, computer equipment and storage medium for drug classification.
背景技术Background technique
药品分类管理是国际通行的管理办法。它是根据药品的安全性、有效性原则,依其品种、规格、适应症、剂量及给药途径等的不同,将药品分为处方药和非处方药并作出相应的管理规定。它的意义在于保障人民用药安全。Drug classification management is an internationally accepted management method. It divides drugs into prescription drugs and non-prescription drugs and makes corresponding management regulations based on the safety and effectiveness principles of drugs, according to their varieties, specifications, indications, dosages and routes of administration. Its significance is to ensure the safety of people's medication.
现有技术中,关于药品分类模型主要从有监督模型入手,为此前期需要大量的人力成本对样本进行标注。发明人意识到人力标注往往存在标注不准确,分类不完善现象,为此还需要大量人力对类别进行增加修改等维护操作。由此导致药品分类耗时耗力,且分类的准确率也较低。In the prior art, the drug classification model mainly starts with a supervised model, which requires a large amount of labor costs to label samples in the previous period. The inventor realizes that manual labeling often has inaccurate labeling and imperfect classification. For this reason, a lot of manpower is required to perform maintenance operations such as adding and modifying categories. As a result, the classification of drugs is time-consuming and labor-intensive, and the accuracy of classification is also low.
发明内容Summary of the invention
本申请实施例提供能够不需要进行标记就能够完成药品分类的药品分类方法、装置、计算机设备及存储介质。The embodiments of the present application provide a drug classification method, device, computer equipment, and storage medium that can complete drug classification without marking.
为解决上述技术问题,本申请创造的实施例采用的一个技术方案是:提供一种药品分类方法,包括:根据用户的病例信息获取表征用户病情和使用药品的目标特征词向量,其中,所述病例信息为文本信息,所述目标特征词向量包括第一词向量和第二词向量,所述第一词向量通过神经网络模型对所述文本信息进行提取得到,所述第二词向量通过对所述文本信息进行停用词过滤后统计得到;将所述目标特征词向量输入至预设的药品分类模型中,其中,所述药品分类模型为通过计算不同特征词向量之间的距离进行聚类的无监督训练模型;根据所述药品分类模型输出的所述使用药品的聚类集,对所述使用药品进行分类标注,其中,所述分类标注内容为所述使用药品的聚类集中的至少一个高频词语。In order to solve the above technical problems, a technical solution adopted in the embodiments created by this application is to provide a method for classifying drugs, including: obtaining, according to the user’s case information, a target feature word vector that characterizes the user’s condition and the use of drugs, wherein the The case information is text information, the target feature word vector includes a first word vector and a second word vector, the first word vector is obtained by extracting the text information through a neural network model, and the second word vector is The text information is statistically obtained after stop words are filtered; the target feature word vector is input into a preset drug classification model, where the drug classification model is clustered by calculating the distance between different feature word vectors Class unsupervised training model; classify and label the used drugs according to the cluster set of the used drugs output by the drug classification model, wherein the classification and label content is the cluster set of the used drugs At least one high-frequency word.
为解决上述技术问题,本申请实施例还提供一种药品分类装置,包括:获取模块,用于根据用户的病例信息获取表征用户病情和使用药品的目标特征词向量,其中,所述病例信息为文本信息,所述目标特征词向量包括第一词向量和第二词向量,所述第一词向量通过神经网络模型对所述文本信息进行提取得到,所述第二词向量通过对所述文本信息进行停用词过滤后统计得到;处理模块,用于将所述目标特征词向量输入 至预设的药品分类模型中,其中,所述药品分类模型为通过计算不同特征词向量之间的距离进行聚类的无监督训练模型;执行模块,用于根据所述药品分类模型输出的所述使用药品的聚类集,对所述使用药品进行分类标注,其中,所述分类标注为所述使用药品的聚类集中至少一个高频词语。In order to solve the above technical problems, an embodiment of the present application also provides a medicine classification device, including: an acquisition module for acquiring a target feature word vector that characterizes the user's condition and the use of medicines according to the user's case information, wherein the case information is Text information, the target feature word vector includes a first word vector and a second word vector, the first word vector is obtained by extracting the text information through a neural network model, and the second word vector is obtained by comparing the text The information is filtered through stop words and then statistics are obtained; the processing module is used to input the target feature word vector into a preset drug classification model, where the drug classification model is calculated by calculating the distance between different feature word vectors An unsupervised training model for clustering; an execution module for classifying and labeling the used drugs according to the cluster set of the used drugs output by the drug classification model, wherein the classification is labeled as the use At least one high-frequency word is concentrated in the cluster of drugs.
为解决上述技术问题,本申请实施例还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行一种药品分类方法的步骤,所述一种药品分类方法包括以下步骤:根据用户的病例信息获取表征用户病情和使用药品的目标特征词向量,其中,所述病例信息为文本信息,所述目标特征词向量包括第一词向量和第二词向量,所述第一词向量通过神经网络模型对所述文本信息进行提取得到,所述第二词向量通过对所述文本信息进行停用词过滤后统计得到;将所述目标特征词向量输入至预设的药品分类模型中,其中,所述药品分类模型为通过计算不同特征词向量之间的距离进行聚类的无监督训练模型;根据所述药品分类模型输出的所述使用药品的聚类集,对所述使用药品进行分类标注,其中,所述分类标注内容为所述使用药品的聚类集中的至少一个高频词语。In order to solve the above technical problems, an embodiment of the present application further provides a computer device including a memory and a processor. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the The processor executes the steps of a method for classifying medicines. The method for classifying medicines includes the following steps: obtaining, according to the user’s case information, a target feature word vector that characterizes the user’s condition and the use of drugs, wherein the case information is text information , The target feature word vector includes a first word vector and a second word vector, the first word vector is obtained by extracting the text information through a neural network model, and the second word vector is obtained by extracting the text information The stop words are filtered and obtained by statistics; the target feature word vector is input into a preset drug classification model, where the drug classification model is an unsupervised training of clustering by calculating the distance between different feature word vectors Model; classify and label the used drugs according to the cluster set of the used drugs output by the drug classification model, wherein the classification and label content is at least one high-frequency word in the cluster set of the used drugs .
为解决上述技术问题,本申请实施例还提供一种存储有计算机可读指令的存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行一种药品分类方法的步骤,所述一种药品分类方法包括以下步骤:根据用户的病例信息获取表征用户病情和使用药品的目标特征词向量,其中,所述病例信息为文本信息,所述目标特征词向量包括第一词向量和第二词向量,所述第一词向量通过神经网络模型对所述文本信息进行提取得到,所述第二词向量通过对所述文本信息进行停用词过滤后统计得到;将所述目标特征词向量输入至预设的药品分类模型中,其中,所述药品分类模型为通过计算不同特征词向量之间的距离进行聚类的无监督训练模型;根据所述药品分类模型输出的所述使用药品的聚类集,对所述使用药品进行分类标注,其中,所述分类标注内容为所述使用药品的聚类集中的至少一个高频词语。In order to solve the above technical problems, embodiments of the present application also provide a storage medium storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, one or more processors execute a Steps of a method for classifying medicines. The method for classifying medicines includes the following steps: obtaining a target feature word vector that characterizes the user’s condition and the use of drugs according to the user’s case information, wherein the case information is text information, and the target feature word The vector includes a first word vector and a second word vector, the first word vector is obtained by extracting the text information through a neural network model, and the second word vector is calculated by filtering the text information by stop words Obtain; input the target feature word vector into a preset drug classification model, where the drug classification model is an unsupervised training model that clusters by calculating the distance between different feature word vectors; according to the drug The cluster set of used medicines output by the classification model classifies and annotates the used medicines, wherein the content of the classification and annotation is at least one high-frequency word in the cluster set of the used medicines.
本申请实施例的能够提高药品分类的效率,且采用病例信息能够进一步的强化药品与病情之间的对应关系,提高了分类结果的准确性。The embodiments of the application can improve the efficiency of drug classification, and the use of case information can further strengthen the correspondence between drugs and disease conditions, and improve the accuracy of the classification results.
附图说明Description of the drawings
图1为本申请实施例药品分类方法的基本流程示意图;Fig. 1 is a schematic diagram of the basic flow of the method for classifying drugs according to an embodiment of the application;
图2为本申请实施例通过神经网络模型采集第一词向量的流程示意图;FIG. 2 is a schematic diagram of a process of collecting a first word vector through a neural network model according to an embodiment of the application;
图3为本申请实施例通过关键词集提取词向量的流程示意图;3 is a schematic diagram of the process of extracting word vectors through keyword sets according to an embodiment of the application;
图4为本申请实施例生成一级聚类集的流程示意图;FIG. 4 is a schematic diagram of a process of generating a first-level cluster set according to an embodiment of the application;
图5为本申请实施例生成二级聚类集的流程示意图;FIG. 5 is a schematic diagram of a process of generating a secondary cluster set according to an embodiment of the application;
图6为本申请实施例生成三级聚类集的流程示意图;FIG. 6 is a schematic diagram of a process of generating a three-level cluster set according to an embodiment of the application;
图7为本申请实施例三级分类的一种示意图;FIG. 7 is a schematic diagram of three-level classification according to an embodiment of this application;
图8为本申请实施例药品分类装置基本结构示意图;FIG. 8 is a schematic diagram of the basic structure of a drug classification device according to an embodiment of the application;
图9为本申请实施例计算机设备基本结构框图。Fig. 9 is a block diagram of the basic structure of a computer device according to an embodiment of the application.
具体实施方式Detailed ways
具体请参阅图1,图1为本实施例药品分类方法的基本流程示意图。Please refer to Fig. 1 for details. Fig. 1 is a schematic diagram of the basic flow of the drug classification method in this embodiment.
如图1所示,一种药品分类方法,包括:As shown in Figure 1, a drug classification method includes:
S1100、根据用户的病例信息获取表征用户病情和使用药品的目标特征词向量,其中,所述病例信息为文本信息,所述目标特征词向量包括第一词向量和第二词向量,所述第一词向量通过神经网络模型对所述文本信息进行提取得到,所述第二词向量通过对所述文本信息进行停用词过滤后统计得到;S1100. Acquire a target feature word vector that characterizes the user's condition and medication use according to the user's case information, where the case information is text information, the target feature word vector includes a first word vector and a second word vector, and the first A word vector is obtained by extracting the text information through a neural network model, and the second word vector is obtained by performing statistics after filtering the text information by stop words;
用户在医院或者诊所进行就医时,将用户的行为信息进行全程记录,记录的行为信息包括:用户的病情信息、使用药品的信息以及用户的化验结果信息,上述行为信息被定义为用户的病情信息。上述病情信息均为文本信息,但是不局限于此,根据具体应用场景不同,在一些实施方式中,病情信息还包括:图片信息和声音信息。When the user seeks medical treatment in a hospital or clinic, the user’s behavior information is recorded throughout the entire process. The recorded behavior information includes: the user’s medical condition, the use of drugs, and the user’s laboratory results. The above-mentioned behavior information is defined as the user’s medical condition. . The foregoing medical condition information is all text information, but is not limited to this. According to different specific application scenarios, in some embodiments, the medical condition information further includes: picture information and sound information.
获取到病例信息后,提取该病例信息中的目标特征词向量,目标特征词向量为表征用户病情和使用药品的向量信息。提取目标特征词向量的方式能够通过已经训练至收敛状态的神经网络模型进行特征向量的提取。在一些实施方式中,目标特征词向量的提取能够通过统计病例信息中关键词的词频计算得到。在一些实施方式中,目标特征词向量的提取首先通过神经网络模型进行提取,然后,再使用词频统计的方法进行计算,最终将两种计算方式得到的结果进行合并后得到目标特征词向量。After obtaining the case information, extract the target feature word vector in the case information, and the target feature word vector is vector information that characterizes the user's condition and the use of drugs. The method of extracting the target feature word vector can be used to extract the feature vector through a neural network model that has been trained to a convergent state. In some embodiments, the target feature word vector can be extracted by calculating the word frequency of the keywords in the case information. In some embodiments, the target feature word vector is first extracted through a neural network model, and then the word frequency statistics method is used for calculation, and finally the results obtained by the two calculation methods are combined to obtain the target feature word vector.
S1200、将所述目标特征词向量输入至预设的药品分类模型中,其中,所述药品分类模型为通过计算不同特征词向量之间的距离进行聚类的无监督训练模型;S1200. Input the target feature word vector into a preset drug classification model, where the drug classification model is an unsupervised training model that performs clustering by calculating the distance between different feature word vectors;
将目标特征词向量输入至预设的药品分类模型中,其中,药品分类模型为通过计算不同特征词向量之间的距离进行聚类的无监督训练模型。The target feature word vector is input into a preset drug classification model, where the drug classification model is an unsupervised training model for clustering by calculating the distance between different feature word vectors.
本实施方式中,药品分类模型采用无监督模型,采用无监督训练模型进行特征词向量的聚类,无监督训练模型主要是计算不同特征词向量之间的类间距离,设置具有衡量性质的距离阈值,将类间距离小于距离阈值的特征词向量进行聚类生成聚类集。通过将大量包括目标特征词向量的特征词向量进行聚类形成多个聚类集,则每个不同的聚类集为药品的一个分类类别。In this embodiment, the drug classification model adopts an unsupervised model, and an unsupervised training model is used to cluster feature word vectors. The unsupervised training model mainly calculates the inter-class distance between different feature word vectors and sets a distance with a measurement property. Threshold, cluster the feature word vectors whose distance between classes is less than the distance threshold to generate a cluster set. By clustering a large number of feature word vectors including the target feature word vector to form multiple cluster sets, each different cluster set is a classification category of the medicine.
类间距离的计算实际是在计算不同药品治疗的病情信息的相似度,类间距离越小则表明不同药品的疗效越相近,类间距离越大则表明不同药品的疗效差别越大。因此,不同的分类类别能够达到的治愈的疾病或者疗效有所不同。The calculation of the distance between classes is actually calculating the similarity of the condition information of different drugs. The smaller the distance between classes, the closer the efficacy of different drugs. The greater the distance between classes, the greater the difference in efficacy of different drugs. Therefore, different classification categories can achieve different cures or curative effects.
在一些实施方式中,为进一步的细化药品的分类类比,将分类类别划分为不同的级别,在第一个级别划分完成后,在不同的聚类集中进行进一步的分类。采用的方法为减小距离阈值的数值,使聚类集中的特征词向量出现进一步的区分。同时,缩小不同特征词向量中的有效点间距的参数值,能够使不同特征词向量的类内距离更加的收敛,类内距离的收敛会进一步地增大特征词向量之间的类间距离,因而能够使聚类集中不同特征词向量之间的分化进一步加大,为进一步细分聚类集中的类别提供了良好的条件。In some embodiments, to further refine the classification and analogy of drugs, the classification categories are divided into different levels, and after the first level division is completed, further classification is performed in different clusters. The method adopted is to reduce the value of the distance threshold, so that the feature word vectors in the cluster set are further distinguished. At the same time, reducing the parameter value of the effective point spacing in different feature word vectors can make the intra-class distance of different feature word vectors more converge, and the convergence of the intra-class distance will further increase the inter-class distance between feature word vectors. Therefore, the differentiation between different feature word vectors in the cluster can be further increased, which provides a good condition for further subdividing the categories in the cluster.
根据上述方法,只要不断地调整距离阈值以及有效点间距的参数值,就能够在不同级别的聚类集中进行进一步的细化分类,形成具有属性分布的分类类别。在一些实施方式中,将聚类集划分为3级,但是不局限于,根据具体应用场景的不同,聚类集能够被划分为:1级、2级、4级、5级或者更多级。According to the above method, as long as the parameter values of the distance threshold and the effective point spacing are continuously adjusted, further refined classification can be performed in clusters of different levels to form classification categories with attribute distribution. In some embodiments, the cluster set is divided into 3 levels, but not limited to, according to different specific application scenarios, the cluster set can be divided into: level 1, level 2, level 4, level 5 or more. .
S1300、根据所述药品分类模型输出的所述使用药品的聚类集,对所述使用药品进 行分类标注,其中,所述分类标注为所述使用药品的聚类集中至少一个高频词语。S1300. Classify and label the used drugs according to the cluster set of the used drugs output by the drug classification model, wherein the classification label is at least one high-frequency word in the cluster set of the used drugs.
根据药品分类模型输出的聚类集,对各个聚类集以及最后一集中的药品进行标注。其中,聚类集的标注方式为:提取该聚类集中各个药品的病例信息中出现频率最高的词语作为该聚类集的标注名称,在一些实施方式中,当同一个聚类集中具有多个标注名称是,则根据出现频率的排序结果进行依次选择。对于药品名称的标注则直接从病例信息中提取药品名称进行标记。According to the cluster set output by the drug classification model, each cluster set and the drugs in the last set are labeled. Wherein, the labeling method of the cluster set is: extract the word with the highest frequency in the case information of each drug in the cluster set as the label name of the cluster set. In some embodiments, when there are multiple If the label name is yes, it is selected in turn according to the sorting result of frequency of occurrence. For the labeling of the drug name, the drug name is directly extracted from the case information for labeling.
上述实施方式中,在进行药品分类时,通过采集用户的病例信息就能够得到药品名称以及该药品对应治疗的病情信息,将上述药品名称已经对应的病情信息转换为目标特征词向量,并且将目标特征词向量作为输入数据输入至无监督的药品分类模型中,药品分类模型通过聚类的方式将能够治愈相同或者相似病情的药品聚类在一起形成聚类类别,该聚类类别就能够成为药品分类的一个类别。最后,在对该分类类别中的药品进行名称标记就完成了药品分类。该分类方式能够提高药品分类的效率,且采用病例信息能够进一步的强化药品与病情之间的对应关系,提高了分类结果的准确性。In the above embodiment, when classifying drugs, the name of the drug and the medical condition information corresponding to the drug can be obtained by collecting the user's case information, the medical condition information corresponding to the drug name is converted into the target feature word vector, and the target The feature word vector is input as input data to the unsupervised drug classification model. The drug classification model clusters drugs that can cure the same or similar conditions together to form a cluster category, and the cluster category can become a drug A category of classification. Finally, the drug classification is completed by labeling the names of the drugs in the classification category. This classification method can improve the efficiency of drug classification, and the use of case information can further strengthen the correspondence between drugs and disease conditions, and improve the accuracy of the classification results.
在一些实施方式中,需要通过神经网络模型对病例信息进行特征提取。请参阅图2,图2为本实施例通过神经网络模型采集第一词向量的流程示意图。In some embodiments, it is necessary to perform feature extraction on case information through a neural network model. Please refer to FIG. 2, which is a schematic diagram of the process of collecting the first word vector through the neural network model in this embodiment.
如图2所示,S1100包括:As shown in Figure 2, S1100 includes:
S1111、将所述病例信息转化为行为向量集;S1111, converting the case information into a behavior vector set;
将病例信息转化为能够被神经网络模型识别或者处理的向量集,采用的方法为通过word2vec模型将病例信息转化为向量集。但是不局限于此,根据具体应用场景的不同,在一些实施方式中,还能够通过TF-IDF(term frequency–inverse document frequency)技术对病例信息进行向量转化。The case information is transformed into a vector set that can be recognized or processed by the neural network model. The method used is to convert the case information into a vector set through the word2vec model. However, it is not limited to this. According to different specific application scenarios, in some embodiments, the case information can also be vectorized through TF-IDF (term frequency—inverse document frequency) technology.
S1112、将所述行为向量集输入至预设的特征提取模型中,其中,所述特征提取模型为预先训练至收敛状态,用于提取行为向量集中表征用户行为向量的神经网络模型;S1112, input the behavior vector set into a preset feature extraction model, where the feature extraction model is a neural network model that is pre-trained to a convergent state and is used to extract behavior vectors to represent user behavior vectors;
将转化后的向量集输入至预设的特征提取模型中,其中,特征提取模型为预先训练至收敛状态,用于提取行为向量集中表征用户行为向量的神经网络模型。The converted vector set is input into a preset feature extraction model, where the feature extraction model is pre-trained to a convergent state, and is used to extract a neural network model that represents user behavior vectors in a set of behavior vectors.
本实施方式中,特征提取模型用于提取向量集中用户病情信息和药品信息关联的词向量。In this embodiment, the feature extraction model is used to extract word vectors associated with user medical information and drug information in the vector set.
为使特征提取模型能够准确的提取与用户病情信息和药品信息关联的词向量,需要对特征提取模型进行训练。训练的方式为:收集训练样本集,训练样本集由若干病例信息转换后的向量集组成,人工标定各个向量集中的词向量,然后,将进行过标注的向量集依次输入至神经网络模型中,神经网络模型提取了激励词向量后,计算激励词向量与标注词向量之间的距离,若距离大于设定的距离阈值,则通过反向传播算法对神经网络模型的权值进行校准。校准完成后,重复上述步骤直至激励词向量与标注词向量之间的距离小于设定的距离阈值,则该向量集训练通过,采用上述方法对训练样本集内的向量集均进行训练,直至该神经网络模型提取词向量的准确率大于设定的值(例如98%)时训练结束,训练完成后的神经网络模型为特征提取模型。In order for the feature extraction model to accurately extract the word vectors associated with the user's medical condition information and drug information, the feature extraction model needs to be trained. The training method is: collect training sample sets, which are composed of vector sets after conversion of several case information, manually calibrate the word vectors in each vector set, and then input the labeled vector sets into the neural network model in turn. After the neural network model extracts the excitation word vector, it calculates the distance between the excitation word vector and the label word vector. If the distance is greater than the set distance threshold, the weight of the neural network model is calibrated through the back propagation algorithm. After the calibration is completed, repeat the above steps until the distance between the excited word vector and the labeled word vector is less than the set distance threshold, then the vector set training is passed, and the vector set in the training sample set is trained by the above method until the When the accuracy of extracting word vectors by the neural network model is greater than a set value (for example, 98%), the training ends, and the neural network model after the training is completed is a feature extraction model.
训练至收敛状态的特征提取模型能够准确的提取向量集中用户病情信息和药品信息关联的词向量,该词向量即为用户行为向量。The feature extraction model trained to the convergent state can accurately extract the word vector associated with the user's condition information and drug information in the vector set, and the word vector is the user behavior vector.
S1113、读取所述特征提取模型输出的所述用户行为向量,并定义所述用户行为向 量为第一词向量。S1113. Read the user behavior vector output by the feature extraction model, and define the user behavior vector as a first word vector.
读取由特征提取模型输出的用户行为向量,由于,用户行为向量表征用户病情信息和药品信息关联的词向量,因此,提取的用户行为向量能够作为药品分类模型的输入数据。本实施方式中,将用户行为向量定义为第一词向量。Read the user behavior vector output by the feature extraction model. Since the user behavior vector represents the word vector associated with the user's medical condition information and drug information, the extracted user behavior vector can be used as the input data of the drug classification model. In this embodiment, the user behavior vector is defined as the first word vector.
通过训练至收敛的神经网络模型能够快速的提取记载有关键信息的词向量,简化了药品分类模型数据处理的程序,有利于提高药品分类模型的处理效率。The neural network model trained to convergence can quickly extract word vectors that record key information, simplifying the data processing procedures of the drug classification model, and improving the processing efficiency of the drug classification model.
在一些实施方式中,为进一步地对病例信息中记载的用户病情信息和药品信息进行采集,减少关键信息的遗漏率,需要通过关键提取的方式进一步的提取。请参阅图3,图3为本实施例通过关键词集提取词向量的流程示意图。In some embodiments, in order to further collect the user's medical condition information and drug information recorded in the case information, and reduce the omission rate of key information, it is necessary to further extract the key information. Please refer to FIG. 3, which is a schematic diagram of the process of extracting word vectors through keyword sets in this embodiment.
如图3所示,S1113之后,包括:As shown in Figure 3, after S1113, it includes:
S1121、通过预设的停用词列表对所述病例信息进行过滤生成关键词集;S1121, filter the case information through a preset stop word list to generate a keyword set;
本实施方式中,为进一步地过滤掉病例信息中与用户病情信息和药品信息无关的信息,需要采用停用词对病例信息进行过滤,停用词即为被过滤掉的词。In this embodiment, in order to further filter out information irrelevant to the user's condition information and drug information in the case information, it is necessary to use stop words to filter the case information, and the stop words are words that are filtered out.
建立停用词列表中记录有通过统计得到的停用词,例如,将动词、副词和形容词等词性的词语设为停用词,则通过停用词列表筛选后,去除了所述病例信息中具有上述词性的停用词,去除停用词后的病例信息生成关键词集,关键词集中记载用户病情信息和药品信息的关键词。Create a stop word list and record stop words obtained through statistics. For example, if words of verbs, adverbs, and adjectives are set as stop words, after filtering through the stop word list, the case information is removed For the stop words with the above-mentioned part of speech, the case information after the stop words are removed generates a keyword set, which records the user's condition information and drug information in the keyword set.
S1122、统计所述关键词集中各个关键词的词频以及所述各个关键词的逆文档频率;S1122, Count the word frequency of each keyword in the keyword set and the inverse document frequency of each keyword;
通过过滤得到关键词集后,计算关键词集中各个关键词的词频,其中,词频的计算方式为:After the keyword set is obtained by filtering, the word frequency of each keyword in the keyword set is calculated. Among them, the calculation method of the word frequency is:
Figure PCTCN2019117240-appb-000001
Figure PCTCN2019117240-appb-000001
计算得到各个关键字的词频后,计算各个关键词的逆文档频率,逆文档频率用于确定各个关键词的重要性,通常逆文档频率的大小与一个词的常见程度成反比,逆文档频率的计算方式为:After calculating the word frequency of each keyword, calculate the inverse document frequency of each keyword. The inverse document frequency is used to determine the importance of each keyword. Generally, the size of the inverse document frequency is inversely proportional to the commonness of a word, and the inverse document frequency is The calculation method is:
Figure PCTCN2019117240-appb-000002
Figure PCTCN2019117240-appb-000002
S1123、通过所述词频和所述逆文档频率计算所述各个关键词的优先级数值;S1123: Calculate the priority value of each keyword according to the word frequency and the inverse document frequency;
计算得到各个关键词的词频和逆文档频率之后,将词频和逆文档频率相乘后得到各个关键词的优先级数值,根据各个优先级数值对各个关键词进行优先级降幂排序,并根据实际需要选择位于前多少位的关键词作为待转换的关键词,例如,提取前20位的关键词作为待转换的关键词,但是,待转换关键词数量的确定不局限于此,根据具体应用场景的不同,在一些实施方式中,待转换关键词的数量能够为任意数值。After calculating the word frequency and inverse document frequency of each keyword, multiply the word frequency and inverse document frequency to obtain the priority value of each keyword. According to the priority value, the priority of each keyword is sorted by descending power, and according to the actual It is necessary to select the top keywords as the keywords to be converted. For example, extract the top 20 keywords as the keywords to be converted. However, the determination of the number of keywords to be converted is not limited to this, according to specific application scenarios In some embodiments, the number of keywords to be converted can be any value.
S1124、根据所述各个关键词的优先级数值生成所述第二词向量。S1124: Generate the second word vector according to the priority value of each keyword.
根据各个优先级数值对各个关键词进行优先级降幂排序,并根据实际需要选择位于前多少位的关键词作为待转换的关键词。通过word2vec模型或TF-IDF技术,将经过优先级数值筛选的待转换的关键词转换为第二词向量。Sort keywords in descending order of priority according to their priority values, and select the top keywords as the keywords to be converted according to actual needs. Through the word2vec model or TF-IDF technology, the keywords to be converted that have been filtered by the priority value are converted into the second word vector.
在一些实施方式中,第一词向量是由神经网络模型提取得到,由于,神经网络模型所提取的词向量与文本信息之间的关联关系,实质上承载着人们的主观意志,通过 反复的定向训练学习得到,但神经网络模型存在多种关联关系交叉训练时难以收敛的缺陷,因此,提取的第一词向量会出现提取的词向量全面性不够或者遗漏关键词向量的问题。而第二词向量是根据停用词过滤后统计得到,进行统计时不携带任何个人意志,能够最直接的反应各个关键词的分布状况,更加全面但不具有重点的进行词向量提取。将第一词向量与第二词向量进行合并,合并后生成的目标特征词向量,数据更加全面,既能够突出人们关注的特征词向量,又能够全面兼顾客观存在的特征词向量,使提取的数据全面且重点突出,而全面且重点突出的数据有利于提高药品分类模型的准确性。请参阅步骤S1131。In some embodiments, the first word vector is extracted by the neural network model. Because the relationship between the word vector extracted by the neural network model and the text information, it essentially carries people’s subjective will, and through repeated orientation Training and learning are obtained, but the neural network model has the defect that it is difficult to converge during the cross-training of multiple association relationships. Therefore, the extracted first word vector will have the problem of insufficient comprehensiveness of the extracted word vector or omission of the keyword vector. The second word vector is calculated based on the filtering of stop words, without any personal will during the statistics, and can most directly reflect the distribution of each keyword, and extract the word vector more comprehensively but without emphasis. Merging the first word vector with the second word vector, the target feature word vector generated after the merging has more comprehensive data, which can not only highlight the feature word vector that people pay attention to, but also fully integrate the feature word vector existing in the customer view, so that the extracted The data is comprehensive and focused, and comprehensive and focused data is conducive to improving the accuracy of the drug classification model. See step S1131.
S1131、将所述第一词向量与所述第二词向量进行合并生成所述目标特征词向量。S1131. Combine the first word vector and the second word vector to generate the target feature word vector.
将第一词向量与第二词向量进行合并的方式为:将第一词向量组成的词向量矩阵与第二词向量组成的词向量矩阵进行相加运算,运算得到的结果为目标特征词向量的向量矩阵,该向量矩阵即为药品分类模型的输入数据。The method of merging the first word vector and the second word vector is: add the word vector matrix composed of the first word vector and the word vector matrix composed of the second word vector, and the result of the operation is the target feature word vector The vector matrix is the input data of the drug classification model.
在一些实施方式中,药品分类模型生成一级聚类集,需要通过计算目标特征词向量与不同的特征词向量之间的欧式距离判断目标特征词向量的聚类集。请参阅图4,图4为本实施例生成一级聚类集的流程示意图。In some embodiments, the drug classification model generates a first-level cluster set, and the cluster set of the target feature word vector needs to be judged by calculating the Euclidean distance between the target feature word vector and different feature word vectors. Please refer to FIG. 4, which is a schematic diagram of the process of generating a first-level cluster set in this embodiment.
如图4所示,S1200包括:As shown in Figure 4, S1200 includes:
S1211、计算所述目标特征词向量与不同的特征词向量之间的第一欧氏距离;S1211, calculate the first Euclidean distance between the target feature word vector and different feature word vectors;
药品分类模型对目标特征词向量进行分类时,需要计算目标特征词向量与其他特征词向量之间的距离,具体地,计算得到目标特征词向量与不同的特征词向量之间的欧式距离,该欧式距离被统称为第一欧氏距离。但是不局限于此,在一些实施方式中,计算的方式为计算目标特征词向量与不同的特征词向量之间马氏距离或者余弦距离。When the drug classification model classifies the target feature word vector, the distance between the target feature word vector and other feature word vectors needs to be calculated. Specifically, the Euclidean distance between the target feature word vector and different feature word vectors is calculated. Euclidean distance is collectively referred to as the first Euclidean distance. However, it is not limited to this. In some embodiments, the calculation method is to calculate the Mahalanobis distance or the cosine distance between the target feature word vector and different feature word vectors.
S1212、将所述第一欧式距离与预设的第一距离阈值进行比对;S1212. Compare the first Euclidean distance with a preset first distance threshold;
将目标特征词向量与不同的特征词向量之间的第一欧式距离与设定的第一距离阈值进行比对。其中,第一距离阈值为衡量特征词向量之间是否符合第一筛选条件的阈值,例如,第一距离阈值的取值为0.5。The first Euclidean distance between the target feature word vector and the different feature word vectors is compared with the set first distance threshold. Wherein, the first distance threshold is a threshold for measuring whether the feature word vectors meet the first screening condition, for example, the value of the first distance threshold is 0.5.
将第一欧式距离与预设的第一距离阈值进行比对就能够判断,目标特征词向量应当与哪一个或哪一类特征词向量进行聚类。By comparing the first Euclidean distance with the preset first distance threshold, it can be judged which feature word vector or type of feature word vector should be clustered with the target feature word vector.
S1213、当所述第一欧式距离小于所述第一距离阈值时,将所述目标特征向量聚类至所述第一欧式距离表征的聚类集中生成一级聚类集。S1213: When the first Euclidean distance is less than the first distance threshold, cluster the target feature vector into a cluster set represented by the first Euclidean distance to generate a first-level cluster set.
通过比对判断当目标特征词向量与某个特征词向量之间的第一欧式距离小于第一距离阈值时,则证明目标特征词向量应当聚类至该特征词向量所在的聚类集中。通过将所有的病例信息的目标特征词向量均完成聚类后,生成了一级聚类集,一级聚类集由至少一个聚类集组成。By comparing and judging that when the first Euclidean distance between the target feature word vector and a certain feature word vector is less than the first distance threshold, it is proved that the target feature word vector should be clustered into the cluster set where the feature word vector is located. After clustering all the target feature word vectors of the case information, a first-level cluster set is generated, and the first-level cluster set is composed of at least one cluster set.
在一些实施方式中,药品分类模型生成二级聚类集,二级聚类集需要在一级聚类集的基础上进行进一步的细化聚类。请参阅图5,图5为本实施例生成二级聚类集的流程示意图。In some embodiments, the drug classification model generates a secondary cluster set, and the secondary cluster set needs to be further refined clustering on the basis of the primary cluster set. Please refer to FIG. 5, which is a schematic diagram of the process of generating a secondary cluster set in this embodiment.
如图5所示,S1213之后,包括:As shown in Figure 5, after S1213, it includes:
S1221、校正所述药品分类模型中有效点间距的参数值生成第一参数值,并在所述一级聚类集内计算所述目标特征词向量与不同的特征词向量之间的第二欧氏距离;S1221. Correct the parameter value of the effective point spacing in the drug classification model to generate a first parameter value, and calculate the second parameter value between the target feature word vector and different feature word vectors in the first-level clustering set. Distance
在药品分类模型进行二级聚类之前,需要对药品分类模型中有效点间距的参数进行调整,有效点间距是指各个特征词向量中未被忽略的类间距离,由于数据计算的效率,在进行类间距离计算之前,需要对类内距离进行筛选,筛选的方式为设定有效点间距的参数值,类间距离中小于有效点间距的参数值的类间距离会被判定为无效,因此,降低有效点间距的参数值的数值,会增加类内距离的多样性,显露出各个特征词向量更多的细节部分,增大在同一聚类集中不同特征词向量之间的差异性,有利于进行二级聚类。经过校正后的有效点间距的参数值为第一参数值。第一参数值的数值小于一聚类集之前药品分类模型设定的有效点间距的参数值。Before the second-level clustering of the drug classification model, the parameters of the effective point spacing in the drug classification model need to be adjusted. The effective point spacing refers to the distance between classes that are not ignored in each feature word vector. Due to the efficiency of data calculation, Before calculating the inter-class distance, you need to filter the intra-class distance. The filtering method is to set the parameter value of the effective point spacing. The inter-class distance that is less than the effective point spacing in the inter-class distance will be judged as invalid. Therefore, , Decreasing the value of the parameter value of the effective point spacing will increase the diversity of the distance within the class, reveal more detailed parts of each feature word vector, and increase the difference between different feature word vectors in the same cluster. Conducive to two-level clustering. The parameter value of the effective point spacing after correction is the first parameter value. The value of the first parameter value is smaller than the parameter value of the effective point spacing set by the drug classification model before a cluster set.
设定了第一参数值后,药品分类模型在一级聚类集中的各个聚类集中进行二级聚类。二级聚类的方式为:在目标特征词向量所在的聚类集中,计算目标特征词向量与其他特征词向量之间的第二欧氏距离。但是不局限于此,在一些实施方式中,第二欧氏距离计算的能够修改为计算目标特征词向量与不同的特征词向量之间马氏距离或者余弦距离。After setting the first parameter value, the drug classification model performs a secondary clustering in each cluster in the primary clustering set. The second-level clustering method is: in the cluster set where the target feature word vector is located, the second Euclidean distance between the target feature word vector and other feature word vectors is calculated. However, it is not limited to this. In some embodiments, the calculation of the second Euclidean distance can be modified to calculate the Mahalanobis distance or the cosine distance between the target feature word vector and different feature word vectors.
S1222、将所述第二欧式距离与预设的第二距离阈值进行比对,其中,所述第二距离阈值小于所述第一距离阈值;S1222. Compare the second Euclidean distance with a preset second distance threshold, where the second distance threshold is less than the first distance threshold;
将目标特征词向量与不同的特征词向量之间的第二欧式距离与设定的第二距离阈值进行比对。其中,第二距离阈值为衡量特征词向量之间是否符合第二筛选条件的阈值,例如,第二距离阈值的取值为0.1。The second Euclidean distance between the target feature word vector and different feature word vectors is compared with the set second distance threshold. Wherein, the second distance threshold is a threshold for measuring whether the feature word vectors meet the second screening condition, for example, the value of the second distance threshold is 0.1.
将第二欧式距离与预设的第二距离阈值进行比对就能够判断,目标特征词向量所在的聚类集中,目标特征词向量应当与哪一个或哪一类特征词向量进行聚类。By comparing the second Euclidean distance with the preset second distance threshold, it can be judged that the target feature word vector should be clustered with which feature word vector or type of feature word vector in the cluster set where the target feature word vector is located.
S1223、当所述第二欧式距离小于所述第二距离阈值时,将所述目标特征向量聚类至所述第二欧式距离表征的聚类集中生成二级聚类集。S1223: When the second Euclidean distance is less than the second distance threshold, cluster the target feature vector into a cluster set represented by the second Euclidean distance to generate a secondary cluster set.
通过比对判断当目标特征词向量与某个特征词向量之间的第二欧式距离小于第二距离阈值时,则证明目标特征词向量应当聚类至该特征词向量所在的聚类集中。通过将所有聚类集中的特征词向量均完成聚类后,生成了二级聚类集,二级聚类集由至少一个聚类集组成。It is judged by comparison that when the second Euclidean distance between the target feature word vector and a certain feature word vector is less than the second distance threshold, it is proved that the target feature word vector should be clustered into the cluster set where the feature word vector is located. After clustering the feature word vectors in all cluster sets, a secondary cluster set is generated, and the secondary cluster set is composed of at least one cluster set.
在一些实施方式中,药品分类模型生成三级聚类集,三级聚类集需要在二级聚类集的基础上进行进一步的细化聚类。请参阅图6,图6为本实施例生成三级聚类集的流程示意图。In some embodiments, the drug classification model generates a three-level cluster set, and the three-level cluster set needs to be further refined clustering on the basis of the two-level cluster set. Please refer to FIG. 6, which is a schematic diagram of the process of generating a three-level cluster set in this embodiment.
如图6所示,S1231之后,包括:As shown in Figure 6, after S1231, it includes:
S1231、校正所述药品分类模型中有效点间距的参数值生成第二参数值,并在所述二级聚类集内计算所述目标特征词向量与不同的特征词向量之间的第三欧氏距离,其中,所述第二参数值小于所述第一参数值;S1231. Correct the parameter value of the effective point spacing in the drug classification model to generate a second parameter value, and calculate the third parameter value between the target feature word vector and different feature word vectors in the secondary cluster set Distance, wherein the second parameter value is smaller than the first parameter value;
在药品分类模型进行三级聚类之前,需要对药品分类模型中有效点间距的参数进行调整,有效点间距是指各个特征词向量中未被忽略的类间距离,由于数据计算的效率,在进行类间距离计算之前,需要对类内距离进行筛选,筛选的方式为设定有效点间距的参数值,类间距离中小于有效点间距的参数值的类间距离会被判定为无效,因此,降低有效点间距的参数值的数值,会增加类内距离的多样性,显露出各个特征词向量更多的细节部分,增大在同一聚类集中不同特征词向量之间的差异性,有利于进 行三级聚类。经过校正后的有效点间距的参数值为第二参数值。第二参数值的数值小于第一参数值。Before the drug classification model performs three-level clustering, the parameters of the effective point spacing in the drug classification model need to be adjusted. The effective point spacing refers to the distance between classes in each feature word vector that has not been ignored. Due to the efficiency of data calculation, Before calculating the inter-class distance, you need to filter the intra-class distance. The filtering method is to set the parameter value of the effective point spacing. The inter-class distance that is less than the effective point spacing in the inter-class distance will be judged as invalid. Therefore, , Decreasing the value of the parameter value of the effective point spacing will increase the diversity of the distance within the class, reveal more detailed parts of each feature word vector, and increase the difference between different feature word vectors in the same cluster. Conducive to three-level clustering. The parameter value of the effective point spacing after correction is the second parameter value. The value of the second parameter value is smaller than the value of the first parameter.
设定了第二参数值后,药品分类模型在二级聚类集中的各个聚类集中进行三级聚类。三级聚类的方式为:在目标特征词向量所在的聚类集中,计算目标特征词向量与其他特征词向量之间的第三欧氏距离。但是不局限于此,在一些实施方式中,第三欧氏距离计算的能够修改为计算目标特征词向量与不同的特征词向量之间马氏距离或者余弦距离。After setting the second parameter value, the drug classification model performs three-level clustering in each cluster in the second-level cluster set. The three-level clustering method is: in the cluster set where the target feature word vector is located, the third Euclidean distance between the target feature word vector and other feature word vectors is calculated. But it is not limited to this. In some embodiments, the calculation of the third Euclidean distance can be modified to calculate the Mahalanobis distance or the cosine distance between the target feature word vector and different feature word vectors.
S1232、将所述第三欧式距离与预设的第三距离阈值进行比对,其中,所述第三距离阈值小于所述第二距离阈值;S1232. Compare the third Euclidean distance with a preset third distance threshold, where the third distance threshold is smaller than the second distance threshold;
将目标特征词向量与不同的特征词向量之间的第三欧式距离与设定的第三距离阈值进行比对。其中,第三距离阈值为衡量特征词向量之间是否符合第三筛选条件的阈值,例如,第三距离阈值的取值为0.05。The third Euclidean distance between the target feature word vector and different feature word vectors is compared with the set third distance threshold. Wherein, the third distance threshold is a threshold for measuring whether the feature word vectors meet the third filtering condition, for example, the value of the third distance threshold is 0.05.
将第三欧式距离与预设的第三距离阈值进行比对就能够判断,目标特征词向量所在的聚类集中,目标特征词向量应当与哪一个或哪一类特征词向量进行聚类。Comparing the third Euclidean distance with the preset third distance threshold can determine the clustering set where the target feature word vector is located, and which feature word vector or type of feature word vector should be clustered with the target feature word vector.
S1233、当所述第三欧式距离小于所述第三距离阈值时,将所述目标特征向量聚类至所述第三欧式距离表征的聚类集中生成三级聚类集。S1233: When the third Euclidean distance is less than the third distance threshold, cluster the target feature vector into a cluster set represented by the third Euclidean distance to generate a three-level cluster set.
通过比对判断当目标特征词向量与某个特征词向量之间的第三欧式距离小于第三距离阈值时,则证明目标特征词向量应当聚类至该特征词向量所在的聚类集中。通过将所有聚类集中的特征词向量均完成聚类后,生成了三级聚类集,三级聚类集由至少一个聚类集组成。至此,完成了药品的三级分类,但是分类级别的设置不局限于此,在一些实施方式中,进一步地校正有效有效点间距的参数值和距离阈值,能够更进一步的进行细化分类。It is judged by comparison that when the third Euclidean distance between the target feature word vector and a certain feature word vector is less than the third distance threshold, it is proved that the target feature word vector should be clustered into the cluster set where the feature word vector is located. After clustering the feature word vectors in all clusters, a three-level cluster set is generated, and the three-level cluster set is composed of at least one cluster set. So far, the three-level classification of drugs is completed, but the setting of the classification level is not limited to this. In some embodiments, the parameter value of the effective effective point spacing and the distance threshold can be further corrected to further refine the classification.
请参阅图7,图7为本实施例三级分类的一种示意图。Please refer to FIG. 7, which is a schematic diagram of the three-level classification in this embodiment.
如图7所示,将药品分类划分为三个级别,分别为:一级聚类集11、二级聚类集12和三级聚类集13。三个不同级别的聚类集呈树状图排布。As shown in Figure 7, the classification of drugs is divided into three levels, namely: a first-level cluster set 11, a second-level cluster set 12, and a third-level cluster set 13. The cluster sets of three different levels are arranged in a dendrogram.
为解决上述技术问题,本申请实施例还提供一种药品分类装置。In order to solve the above technical problem, the embodiment of the present application also provides a medicine classification device.
具体请参阅图8,图8为本实施例药品分类装置基本结构示意图。Please refer to FIG. 8 for details. FIG. 8 is a schematic diagram of the basic structure of the medicine classification device of this embodiment.
如图8所示,一种药品分类装置,包括:获取模块2100、处理模块2200和执行模块2300。其中,获取模块2100用于根据用户的病例信息获取表征用户病情和使用药品的目标特征词向量,其中,病例信息为文本信息,所述目标特征词向量包括第一词向量和第二词向量,所述第一词向量通过神经网络模型对所述文本信息进行提取得到,所述第二词向量通过对所述文本信息进行停用词过滤后统计得到;处理模块2200用于将目标特征词向量输入至预设的药品分类模型中,其中,药品分类模型为通过计算不同特征词向量之间的距离进行聚类的无监督训练模型;执行模块2300用于根据药品分类模型输出的使用药品的聚类集,对使用药品的分类信息进行标注,其中,所述分类标注内容为所述使用药品的聚类集中的至少一个高频词语。As shown in FIG. 8, a medicine classification device includes: an acquisition module 2100, a processing module 2200, and an execution module 2300. Wherein, the acquisition module 2100 is configured to acquire the target feature word vector that characterizes the user's condition and the use of drugs according to the user's case information, where the case information is text information, and the target feature word vector includes a first word vector and a second word vector, The first word vector is obtained by extracting the text information through a neural network model, and the second word vector is obtained by performing stop word filtering on the text information and then performing statistics; the processing module 2200 is used to obtain the target feature word vector Input to the preset drug classification model, where the drug classification model is an unsupervised training model that clusters by calculating the distance between different feature word vectors; the execution module 2300 is used to cluster the drugs used according to the output of the drug classification model. The cluster is used to label the classification information of the used drugs, wherein the content of the classification and annotation is at least one high-frequency word in the cluster set of the used drugs.
药品分类装置在进行药品分类时,通过采集用户的病例信息就能够得到药品名称以及该药品对应治疗的病情信息,将上述药品名称已经对应的病情信息转换为目标特征词向量,并且将目标特征词向量作为输入数据输入至无监督的药品分类模型中,药 品分类模型通过聚类的方式将能够治愈相同或者相似病情的药品聚类在一起形成聚类类别,该聚类类别就能够成为药品分类的一个类别。最后,在对该分类类别中的药品进行名称标记就完成了药品分类。该分类方式能够提高药品分类的效率,且采用病例信息能够进一步的强化药品与病情之间的对应关系,提高了分类结果的准确性。When the drug classification device classifies drugs, it can obtain the name of the drug and the disease information corresponding to the drug by collecting the user's case information, convert the disease information corresponding to the drug name into a target feature word vector, and convert the target feature word The vector is input as input data into the unsupervised drug classification model. The drug classification model clusters drugs that can cure the same or similar conditions together to form a cluster category. This cluster category can become a drug classification model. A category. Finally, the drug classification is completed by labeling the names of the drugs in the classification category. This classification method can improve the efficiency of drug classification, and the use of case information can further strengthen the correspondence between drugs and disease conditions, and improve the accuracy of the classification results.
在一些实施方式中,目标特征词向量包括:第一词向量,药品分类装置包括:第一转化子模块、第一处理子模块和第一执行子模块。其中,第一转化子模块用于将病例信息转化为行为向量集;第一处理子模块用于将行为向量集输入至预设的特征提取模型中,其中,特征提取模型为预先训练至收敛状态,用于提取行为向量集中表征用户行为向量的神经网络模型;第一执行子模块用于读取特征提取模型输出的用户行为向量,并定义用户行为向量为第一词向量。In some embodiments, the target feature word vector includes: a first word vector, and the medicine classification device includes: a first conversion submodule, a first processing submodule, and a first execution submodule. Among them, the first conversion sub-module is used to convert the case information into a behavior vector set; the first processing sub-module is used to input the behavior vector set into a preset feature extraction model, where the feature extraction model is pre-trained to a convergent state , Is used to extract the neural network model of the behavior vector centrally representing the user behavior vector; the first execution sub-module is used to read the user behavior vector output by the feature extraction model, and define the user behavior vector as the first word vector.
在一些实施方式中,目标特征词向量包括:第二词向量,药品分类装置包括:第一过滤子模块、第二处理子模块、第一计算子模块和第二执行子模块。其中,第一过滤子模块用于通过预设的停用词列表对病例信息进行过滤生成关键词集;第二处理子模块用于统计关键词集中各个关键词的词频以及各个关键词的逆文档频率;第一计算子模块用于通过词频和逆文档频率计算各个关键词的优先级数值;第二执行子模块用于根据各个关键词的优先级数值生成第二词向量。In some embodiments, the target feature word vector includes: a second word vector, and the drug classification device includes: a first filtering submodule, a second processing submodule, a first calculation submodule, and a second execution submodule. Among them, the first filtering submodule is used to filter case information through a preset stop word list to generate a keyword set; the second processing submodule is used to count the word frequency of each keyword in the keyword set and the inverse document of each keyword Frequency; the first calculation sub-module is used to calculate the priority value of each keyword by word frequency and inverse document frequency; the second execution sub-module is used to generate the second word vector according to the priority value of each keyword.
在一些实施方式中,药品分类装置包括:第一合并子模块,用于将第一词向量与第二词向量进行合并生成目标特征词向量。In some embodiments, the drug classification device includes: a first merging sub-module for merging the first word vector and the second word vector to generate the target feature word vector.
在一些实施方式中,药品分类装置包括:第一计算子模块、第一比对子模块和第三执行子模块。其中,第一计算子模块用于计算目标特征词向量与不同的特征词向量之间的第一欧氏距离;第一比对子模块用于将第一欧式距离与预设的第一距离阈值进行比对;第三执行子模块用于当第一欧式距离小于第一距离阈值时,将目标特征向量聚类至第一欧式距离表征的聚类集中生成一级聚类集。In some embodiments, the drug classification device includes: a first calculation submodule, a first comparison submodule, and a third execution submodule. Among them, the first calculation sub-module is used to calculate the first Euclidean distance between the target feature word vector and different feature word vectors; the first comparison sub-module is used to compare the first Euclidean distance with a preset first distance threshold Perform comparison; the third execution sub-module is used to cluster the target feature vector to the cluster set represented by the first Euclidean distance to generate a first-level cluster set when the first Euclidean distance is less than the first distance threshold.
在一些实施方式中,药品分类装置包括:第二计算子模块、第二比对子模块和第四执行子模块。其中,第二计算子模块用于校正药品分类模型中有效点间距的参数值生成第一参数值,并在一级聚类集内计算目标特征词向量与不同的特征词向量之间的第二欧氏距离;第二比对子模块用于将第二欧式距离与预设的第二距离阈值进行比对,其中,第二距离阈值小于第一距离阈值;第四执行子模块用于当第二欧式距离小于第二距离阈值时,将目标特征向量聚类至第二欧式距离表征的聚类集中生成二级聚类集。In some embodiments, the drug classification device includes: a second calculation submodule, a second comparison submodule, and a fourth execution submodule. Among them, the second calculation sub-module is used to correct the parameter value of the effective point spacing in the drug classification model to generate the first parameter value, and calculate the second parameter between the target feature word vector and different feature word vectors in the first-level clustering set. Euclidean distance; the second comparison submodule is used to compare the second Euclidean distance with a preset second distance threshold, where the second distance threshold is less than the first distance threshold; the fourth execution submodule is used for When the second Euclidean distance is less than the second distance threshold, cluster the target feature vector to the cluster set represented by the second Euclidean distance to generate a secondary cluster set.
在一些实施方式中,药品分类装置包括:第三计算子模块、第三比对子模块和第五执行子模块。其中,第三计算子模块用于校正药品分类模型中有效点间距的参数值生成第二参数值,并在二级聚类集内计算目标特征词向量与不同的特征词向量之间的第三欧氏距离,其中,第二参数值小于第一参数值;第三比对子模块用于将第三欧式距离与预设的第三距离阈值进行比对,其中,第三距离阈值小于第二距离阈值;第五执行子模块用于当第三欧式距离小于第三距离阈值时,将目标特征向量聚类至第三欧式距离表征的聚类集中生成三级聚类集。In some embodiments, the drug classification device includes: a third calculation submodule, a third comparison submodule, and a fifth execution submodule. Among them, the third calculation sub-module is used to correct the parameter value of the effective point spacing in the drug classification model to generate the second parameter value, and calculate the third parameter value between the target feature word vector and different feature word vectors in the secondary clustering set. Euclidean distance, where the second parameter value is less than the first parameter value; the third comparison sub-module is used to compare the third Euclidean distance with a preset third distance threshold, where the third distance threshold is less than the second Distance threshold; the fifth execution submodule is used to cluster the target feature vector into the cluster set represented by the third Euclidean distance to generate a three-level cluster set when the third Euclidean distance is less than the third distance threshold.
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图9,图9为本实施例计算机设备基本结构框图。In order to solve the above technical problems, the embodiments of the present application also provide computer equipment. Please refer to FIG. 9 for details. FIG. 9 is a block diagram of the basic structure of the computer device in this embodiment.
如图9所示,计算机设备的内部结构示意图。该计算机设备包括通过系统总线连 接的处理器、存储介质、存储器和网络接口。其中,存储介质可以是易失性的,也可以是非易失性的,该计算机设备的存储介质存储有操作系统、数据库和计算机可读指令,数据库中可存储有控件信息序列,该计算机可读指令被处理器执行时,可使得处理器实现一种药品分类方法。该计算机设备的处理器用于提供计算和控制能力,支撑整个计算机设备的运行。该计算机设备的存储器中可存储有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器执行一种药品分类方法。该计算机设备的网络接口用于与终端连接通信。本领域技术人员可以理解,图9中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。As shown in Figure 9, a schematic diagram of the internal structure of the computer equipment. The computer equipment includes a processor, a storage medium, a memory, and a network interface connected through a system bus. The storage medium may be volatile or non-volatile. The storage medium of the computer device stores an operating system, a database, and computer-readable instructions. The database may store control information sequences, which are readable by the computer. When the instructions are executed by the processor, the processor can realize a medicine classification method. The processor of the computer equipment is used to provide calculation and control capabilities, and supports the operation of the entire computer equipment. Computer readable instructions may be stored in the memory of the computer device, and when the computer readable instructions are executed by the processor, the processor can make the processor execute a medicine classification method. The network interface of the computer device is used to connect and communicate with the terminal. Those skilled in the art can understand that the structure shown in FIG. 9 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. The specific computer equipment may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
本实施方式中处理器用于执行图8中获取模块2100、处理模块2200和执行模块2300的具体功能,存储器存储有执行上述模块所需的程序代码和各类数据。网络接口用于向用户终端或服务器之间的数据传输。本实施方式中的存储器存储有药品分类装置中执行所有子模块所需的程序代码及数据,服务器能够调用服务器的程序代码及数据执行所有子模块的功能。In this embodiment, the processor is used to execute the specific functions of the acquisition module 2100, the processing module 2200, and the execution module 2300 in FIG. 8, and the memory stores the program codes and various data required to execute the above modules. The network interface is used for data transmission between user terminals or servers. The memory in this embodiment stores the program codes and data required to execute all the sub-modules in the medicine classification device, and the server can call the program codes and data of the server to execute the functions of all the sub-modules.
计算机设备在进行药品分类时,通过采集用户的病例信息就能够得到药品名称以及该药品对应治疗的病情信息,将上述药品名称已经对应的病情信息转换为目标特征词向量,并且将目标特征词向量作为输入数据输入至无监督的药品分类模型中,药品分类模型通过聚类的方式将能够治愈相同或者相似病情的药品聚类在一起形成聚类类别,该聚类类别就能够成为药品分类的一个类别。最后,在对该分类类别中的药品进行名称标记就完成了药品分类。该分类方式能够提高药品分类的效率,且采用病例信息能够进一步的强化药品与病情之间的对应关系,提高了分类结果的准确性。When the computer equipment classifies drugs, it can obtain the name of the drug and the medical condition information corresponding to the drug by collecting the user's case information, convert the medical condition information corresponding to the drug name into the target feature word vector, and convert the target feature word vector As input data, it is input into an unsupervised drug classification model. The drug classification model clusters drugs that can cure the same or similar conditions together to form a cluster category. This cluster category can become a drug classification category. Finally, the drug classification is completed by labeling the names of the drugs in the classification category. This classification method can improve the efficiency of drug classification, and the use of case information can further strengthen the correspondence between drugs and disease conditions, and improve the accuracy of the classification results.
本申请还提供一种存储有计算机可读指令的存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述任一实施例药品分类方法的步骤。The present application also provides a storage medium storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the steps of the drug classification method in any of the foregoing embodiments.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该计算机程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等存储介质,或随机存储记忆体(Random Access Memory,RAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The computer program can be stored in a computer readable storage medium. When executed, it may include the processes of the above-mentioned method embodiments. Among them, the aforementioned storage medium may be a storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.

Claims (20)

  1. 一种药品分类方法,包括:A method for classifying medicines, including:
    根据用户的病例信息获取表征用户病情和使用药品的目标特征词向量,其中,所述病例信息为文本信息,所述目标特征词向量包括第一词向量和第二词向量,所述第一词向量通过神经网络模型对所述文本信息进行提取得到,所述第二词向量通过对所述文本信息进行停用词过滤后统计得到;According to the user's case information, a target feature word vector that characterizes the user's condition and the use of drugs is acquired, where the case information is text information, and the target feature word vector includes a first word vector and a second word vector, and the first word The vector is obtained by extracting the text information through a neural network model, and the second word vector is obtained by performing statistics after filtering the text information with stop words;
    将所述目标特征词向量输入至预设的药品分类模型中,其中,所述药品分类模型为通过计算不同特征词向量之间的距离进行聚类的无监督训练模型;Inputting the target feature word vector into a preset drug classification model, where the drug classification model is an unsupervised training model that performs clustering by calculating the distance between different feature word vectors;
    根据所述药品分类模型输出的所述使用药品的聚类集,对所述使用药品进行分类标注,其中,所述分类标注内容为所述使用药品的聚类集中的至少一个高频词语。Classify and label the used drugs according to the cluster set of the used drugs output by the drug classification model, wherein the content of the classification and annotation is at least one high-frequency word in the cluster set of the used drugs.
  2. 根据权利要求1所述的药品分类方法,所述根据用户的病例信息获取表征用户病情和使用药品的目标特征词向量包括:The method for classifying medicines according to claim 1, wherein said acquiring, according to the user's case information, the target feature word vector that characterizes the user's condition and the use of medicines comprises:
    将所述病例信息转化为行为向量集;Converting the case information into a behavior vector set;
    将所述行为向量集输入至预设的特征提取模型中,其中,所述特征提取模型为预先训练至收敛状态,用于提取行为向量集中表征用户行为向量的神经网络模型;Inputting the behavior vector set into a preset feature extraction model, where the feature extraction model is pre-trained to a convergent state, and is used to extract a neural network model that represents a user behavior vector in a collection of behavior vectors;
    读取所述特征提取模型输出的所述用户行为向量,并定义所述用户行为向量为第一词向量。Read the user behavior vector output by the feature extraction model, and define the user behavior vector as the first word vector.
  3. 根据权利要求2所述的药品分类方法,所述读取所述特征提取模型输出的所述用户行为向量,并定义所述用户行为向量为第一词向量之后,包括:The medicine classification method according to claim 2, after reading the user behavior vector output by the feature extraction model and defining the user behavior vector as a first word vector, the method comprises:
    通过预设的停用词列表对所述病例信息进行过滤生成关键词集;Filtering the case information through a preset stop word list to generate a keyword set;
    统计所述关键词集中各个关键词的词频以及所述各个关键词的逆文档频率;Count the word frequency of each keyword in the keyword set and the inverse document frequency of each keyword;
    通过所述词频和所述逆文档频率计算所述各个关键词的优先级数值;Calculating the priority value of each keyword according to the word frequency and the inverse document frequency;
    根据所述各个关键词的优先级数值生成所述第二词向量。The second word vector is generated according to the priority value of each keyword.
  4. 根据权利要求3所述的药品分类方法,所述根据所述各个关键词的优先级数值生成所述第二词向量之后,包括:The method for classifying medicines according to claim 3, after generating the second word vector according to the priority value of each keyword, the method includes:
    将所述第一词向量与所述第二词向量进行合并生成所述目标特征词向量。Combining the first word vector and the second word vector to generate the target feature word vector.
  5. 根据权利要求1所述的药品分类方法,所述将所述目标特征词向量输入至预设的药品分类模型中包括:The medicine classification method according to claim 1, wherein the inputting the target feature word vector into a preset medicine classification model comprises:
    计算所述目标特征词向量与不同的特征词向量之间的第一欧氏距离;Calculating the first Euclidean distance between the target feature word vector and different feature word vectors;
    将所述第一欧式距离与预设的第一距离阈值进行比对;Comparing the first Euclidean distance with a preset first distance threshold;
    当所述第一欧式距离小于所述第一距离阈值时,将所述目标特征向量聚类至所述第一欧式距离表征的聚类集中生成一级聚类集。When the first Euclidean distance is less than the first distance threshold, clustering the target feature vector into a cluster set represented by the first Euclidean distance to generate a first-level cluster set.
  6. 根据权利要求5所述的药品分类方法,所述当所述欧式距离大于所述第一距离阈值时,将所述目标特征向量聚类至所述欧式距离表征的聚类集中生成一级聚类集之后,包括:The method for classifying medicines according to claim 5, wherein when the Euclidean distance is greater than the first distance threshold, clustering the target feature vector to clusters characterized by the Euclidean distance to generate a first-level cluster After the set, include:
    校正所述药品分类模型中有效点间距的参数值生成第一参数值,并在所述一级聚类集内计算所述目标特征词向量与不同的特征词向量之间的第二欧氏距离;The parameter value of the effective point spacing in the drug classification model is corrected to generate a first parameter value, and the second Euclidean distance between the target feature word vector and different feature word vectors is calculated in the first-level cluster set ;
    将所述第二欧式距离与预设的第二距离阈值进行比对,其中,所述第二距离阈值 小于所述第一距离阈值;Comparing the second Euclidean distance with a preset second distance threshold, where the second distance threshold is smaller than the first distance threshold;
    当所述第二欧式距离小于所述第二距离阈值时,将所述目标特征向量聚类至所述第二欧式距离表征的聚类集中生成二级聚类集。When the second Euclidean distance is less than the second distance threshold, cluster the target feature vector into a cluster set represented by the second Euclidean distance to generate a secondary cluster set.
  7. 根据权利要求6所述的药品分类方法,所述当所述第二欧式距离大于所述第二距离阈值时,将所述目标特征向量聚类至所述第二欧式距离表征的聚类集中生成二级聚类集之后,包括:The method for classifying medicines according to claim 6, wherein when the second Euclidean distance is greater than the second distance threshold, clustering the target feature vector to a cluster characterized by the second Euclidean distance is generated After the secondary clustering set, include:
    校正所述药品分类模型中有效点间距的参数值生成第二参数值,并在所述二级聚类集内计算所述目标特征词向量与不同的特征词向量之间的第三欧氏距离,其中,所述第二参数值小于所述第一参数值;The parameter value of the effective point spacing in the drug classification model is corrected to generate a second parameter value, and the third Euclidean distance between the target feature word vector and different feature word vectors is calculated in the secondary cluster set , Wherein the second parameter value is less than the first parameter value;
    将所述第三欧式距离与预设的第三距离阈值进行比对,其中,所述第三距离阈值小于所述第二距离阈值;Comparing the third Euclidean distance with a preset third distance threshold, where the third distance threshold is smaller than the second distance threshold;
    当所述第三欧式距离小于所述第三距离阈值时,将所述目标特征向量聚类至所述第三欧式距离表征的聚类集中生成三级聚类集。When the third Euclidean distance is less than the third distance threshold, cluster the target feature vector into a cluster set represented by the third Euclidean distance to generate a three-level cluster set.
  8. 一种药品分类装置,包括:A medicine classification device includes:
    获取模块,用于根据用户的病例信息获取表征用户病情和使用药品的目标特征词向量,其中,所述病例信息为文本信息,所述目标特征词向量包括第一词向量和第二词向量,所述第一词向量通过神经网络模型对所述文本信息进行提取得到,所述第二词向量通过对所述文本信息进行停用词过滤后统计得到;The obtaining module is used to obtain the target feature word vector that characterizes the user's condition and the use of drugs according to the user's case information, wherein the case information is text information, and the target feature word vector includes a first word vector and a second word vector The first word vector is obtained by extracting the text information through a neural network model, and the second word vector is obtained by performing statistics after filtering the text information by stop words;
    处理模块,用于将所述目标特征词向量输入至预设的药品分类模型中,其中,所述药品分类模型为通过计算不同特征词向量之间的距离进行聚类的无监督训练模型;A processing module, configured to input the target feature word vector into a preset drug classification model, where the drug classification model is an unsupervised training model that performs clustering by calculating the distance between different feature word vectors;
    执行模块,用于根据所述药品分类模型输出的所述使用药品的聚类集,对所述使用药品进行分类标注,其中,所述分类标注为所述使用药品的聚类集中至少一个高频词语。The execution module is configured to classify and label the used drugs according to the cluster set of the used drugs output by the drug classification model, wherein the classification label is at least one high frequency in the cluster set of the used drugs Words.
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行一种药品分类方法,所述一种药品分类方法包括以下步骤:A computer device includes a memory and a processor. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the processor executes a method for classifying medicines. A drug classification method includes the following steps:
    根据用户的病例信息获取表征用户病情和使用药品的目标特征词向量,其中,所述病例信息为文本信息,所述目标特征词向量包括第一词向量和第二词向量,所述第一词向量通过神经网络模型对所述文本信息进行提取得到,所述第二词向量通过对所述文本信息进行停用词过滤后统计得到;According to the user's case information, a target feature word vector that characterizes the user's condition and the use of drugs is acquired, where the case information is text information, and the target feature word vector includes a first word vector and a second word vector, and the first word The vector is obtained by extracting the text information through a neural network model, and the second word vector is obtained by performing statistics after filtering the text information with stop words;
    将所述目标特征词向量输入至预设的药品分类模型中,其中,所述药品分类模型为通过计算不同特征词向量之间的距离进行聚类的无监督训练模型;Inputting the target feature word vector into a preset drug classification model, where the drug classification model is an unsupervised training model that performs clustering by calculating the distance between different feature word vectors;
    根据所述药品分类模型输出的所述使用药品的聚类集,对所述使用药品进行分类标注,其中,所述分类标注内容为所述使用药品的聚类集中的至少一个高频词语。Classify and label the used drugs according to the cluster set of the used drugs output by the drug classification model, wherein the content of the classification and annotation is at least one high-frequency word in the cluster set of the used drugs.
  10. 根据权利要求9所述的计算机设备,所述根据用户的病例信息获取表征用户病情和使用药品的目标特征词向量包括:8. The computer device according to claim 9, wherein said acquiring a target feature word vector that characterizes the user's condition and the use of drugs according to the user's case information comprises:
    将所述病例信息转化为行为向量集;Converting the case information into a behavior vector set;
    将所述行为向量集输入至预设的特征提取模型中,其中,所述特征提取模型为预先训练至收敛状态,用于提取行为向量集中表征用户行为向量的神经网络模型;Inputting the behavior vector set into a preset feature extraction model, where the feature extraction model is pre-trained to a convergent state, and is used to extract a neural network model that represents a user behavior vector in a collection of behavior vectors;
    读取所述特征提取模型输出的所述用户行为向量,并定义所述用户行为向量为第一词向量。Read the user behavior vector output by the feature extraction model, and define the user behavior vector as the first word vector.
  11. 根据权利要求10所述的计算机设备,所述读取所述特征提取模型输出的所述用户行为向量,并定义所述用户行为向量为第一词向量之后,包括:The computer device according to claim 10, after the reading the user behavior vector output by the feature extraction model and defining the user behavior vector as a first word vector, the method comprises:
    通过预设的停用词列表对所述病例信息进行过滤生成关键词集;Filtering the case information through a preset stop word list to generate a keyword set;
    统计所述关键词集中各个关键词的词频以及所述各个关键词的逆文档频率;Count the word frequency of each keyword in the keyword set and the inverse document frequency of each keyword;
    通过所述词频和所述逆文档频率计算所述各个关键词的优先级数值;Calculating the priority value of each keyword according to the word frequency and the inverse document frequency;
    根据所述各个关键词的优先级数值生成所述第二词向量。The second word vector is generated according to the priority value of each keyword.
  12. 根据权利要求11所述的计算机设备,所述根据所述各个关键词的优先级数值生成所述第二词向量之后,包括:11. The computer device according to claim 11, after the generating the second word vector according to the priority value of each keyword, comprising:
    将所述第一词向量与所述第二词向量进行合并生成所述目标特征词向量。Combining the first word vector and the second word vector to generate the target feature word vector.
  13. 根据权利要求9所述的计算机设备,所述将所述目标特征词向量输入至预设的药品分类模型中包括:The computer device according to claim 9, wherein the inputting the target feature word vector into a preset medicine classification model comprises:
    计算所述目标特征词向量与不同的特征词向量之间的第一欧氏距离;Calculating the first Euclidean distance between the target feature word vector and different feature word vectors;
    将所述第一欧式距离与预设的第一距离阈值进行比对;Comparing the first Euclidean distance with a preset first distance threshold;
    当所述第一欧式距离小于所述第一距离阈值时,将所述目标特征向量聚类至所述第一欧式距离表征的聚类集中生成一级聚类集。When the first Euclidean distance is less than the first distance threshold, clustering the target feature vector into a cluster set represented by the first Euclidean distance to generate a first-level cluster set.
  14. 根据权利要求13所述的计算机设备,所述当所述欧式距离大于所述第一距离阈值时,将所述目标特征向量聚类至所述欧式距离表征的聚类集中生成一级聚类集之后,包括:The computer device according to claim 13, wherein when the Euclidean distance is greater than the first distance threshold, cluster the target feature vector to a cluster set represented by the Euclidean distance to generate a first-level cluster set After that, include:
    校正所述药品分类模型中有效点间距的参数值生成第一参数值,并在所述一级聚类集内计算所述目标特征词向量与不同的特征词向量之间的第二欧氏距离;The parameter value of the effective point spacing in the drug classification model is corrected to generate a first parameter value, and the second Euclidean distance between the target feature word vector and different feature word vectors is calculated in the first-level cluster set ;
    将所述第二欧式距离与预设的第二距离阈值进行比对,其中,所述第二距离阈值小于所述第一距离阈值;Comparing the second Euclidean distance with a preset second distance threshold, where the second distance threshold is less than the first distance threshold;
    当所述第二欧式距离小于所述第二距离阈值时,将所述目标特征向量聚类至所述第二欧式距离表征的聚类集中生成二级聚类集。When the second Euclidean distance is less than the second distance threshold, cluster the target feature vector into a cluster set represented by the second Euclidean distance to generate a secondary cluster set.
  15. 根据权利要求14所述的计算机设备,所述当所述第二欧式距离大于所述第二距离阈值时,将所述目标特征向量聚类至所述第二欧式距离表征的聚类集中生成二级聚类集之后,包括:The computer device according to claim 14, wherein when the second Euclidean distance is greater than the second distance threshold, cluster the target feature vector into a cluster represented by the second Euclidean distance to generate two After the first-level clustering set, include:
    校正所述药品分类模型中有效点间距的参数值生成第二参数值,并在所述二级聚类集内计算所述目标特征词向量与不同的特征词向量之间的第三欧氏距离,其中,所述第二参数值小于所述第一参数值;The parameter value of the effective point spacing in the drug classification model is corrected to generate a second parameter value, and the third Euclidean distance between the target feature word vector and different feature word vectors is calculated in the secondary cluster set , Wherein the second parameter value is less than the first parameter value;
    将所述第三欧式距离与预设的第三距离阈值进行比对,其中,所述第三距离阈值小于所述第二距离阈值;Comparing the third Euclidean distance with a preset third distance threshold, where the third distance threshold is smaller than the second distance threshold;
    当所述第三欧式距离小于所述第三距离阈值时,将所述目标特征向量聚类至所述第三欧式距离表征的聚类集中生成三级聚类集。When the third Euclidean distance is less than the third distance threshold, cluster the target feature vector into a cluster set represented by the third Euclidean distance to generate a three-level cluster set.
  16. 一种存储有计算机可读指令的存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行一种药品分类方法,所述一种药品分类方法包括以下步骤:A storage medium storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors execute a method for classifying drugs. The method for classifying drugs includes The following steps:
    根据用户的病例信息获取表征用户病情和使用药品的目标特征词向量,其中,所述病例信息为文本信息,所述目标特征词向量包括第一词向量和第二词向量,所述第一词向量通过神经网络模型对所述文本信息进行提取得到,所述第二词向量通过对所述文本信息进行停用词过滤后统计得到;According to the user's case information, a target feature word vector that characterizes the user's condition and the use of drugs is acquired, where the case information is text information, and the target feature word vector includes a first word vector and a second word vector, and the first word The vector is obtained by extracting the text information through a neural network model, and the second word vector is obtained by performing statistics after filtering the text information with stop words;
    将所述目标特征词向量输入至预设的药品分类模型中,其中,所述药品分类模型为通过计算不同特征词向量之间的距离进行聚类的无监督训练模型;Inputting the target feature word vector into a preset drug classification model, where the drug classification model is an unsupervised training model that performs clustering by calculating the distance between different feature word vectors;
    根据所述药品分类模型输出的所述使用药品的聚类集,对所述使用药品进行分类标注,其中,所述分类标注内容为所述使用药品的聚类集中的至少一个高频词语。Classify and label the used drugs according to the cluster set of the used drugs output by the drug classification model, wherein the content of the classification and annotation is at least one high-frequency word in the cluster set of the used drugs.
  17. 根据权利要求16所述的存储介质,所述根据用户的病例信息获取表征用户病情和使用药品的目标特征词向量包括:The storage medium according to claim 16, wherein the obtaining of the target feature word vector characterizing the user's condition and the use of drugs according to the user's case information comprises:
    将所述病例信息转化为行为向量集;Converting the case information into a behavior vector set;
    将所述行为向量集输入至预设的特征提取模型中,其中,所述特征提取模型为预先训练至收敛状态,用于提取行为向量集中表征用户行为向量的神经网络模型;Inputting the behavior vector set into a preset feature extraction model, where the feature extraction model is pre-trained to a convergent state, and is used to extract a neural network model that represents a user behavior vector in a collection of behavior vectors;
    读取所述特征提取模型输出的所述用户行为向量,并定义所述用户行为向量为第一词向量。Read the user behavior vector output by the feature extraction model, and define the user behavior vector as the first word vector.
  18. 根据权利要求17所述的存储介质,所述读取所述特征提取模型输出的所述用户行为向量,并定义所述用户行为向量为第一词向量之后,包括:The storage medium according to claim 17, after the reading the user behavior vector output by the feature extraction model and defining the user behavior vector as a first word vector, the method comprises:
    通过预设的停用词列表对所述病例信息进行过滤生成关键词集;Filtering the case information through a preset stop word list to generate a keyword set;
    统计所述关键词集中各个关键词的词频以及所述各个关键词的逆文档频率;Count the word frequency of each keyword in the keyword set and the inverse document frequency of each keyword;
    通过所述词频和所述逆文档频率计算所述各个关键词的优先级数值;Calculating the priority value of each keyword according to the word frequency and the inverse document frequency;
    根据所述各个关键词的优先级数值生成所述第二词向量。The second word vector is generated according to the priority value of each keyword.
  19. 根据权利要求18所述的存储介质,所述根据所述各个关键词的优先级数值生成所述第二词向量之后,包括:The storage medium according to claim 18, after the generating the second word vector according to the priority value of each keyword, the method comprises:
    将所述第一词向量与所述第二词向量进行合并生成所述目标特征词向量。Combining the first word vector and the second word vector to generate the target feature word vector.
  20. 根据权利要求16所述的存储介质,所述将所述目标特征词向量输入至预设的药品分类模型中包括:The storage medium according to claim 16, said inputting the target feature word vector into a preset medicine classification model comprises:
    计算所述目标特征词向量与不同的特征词向量之间的第一欧氏距离;Calculating the first Euclidean distance between the target feature word vector and different feature word vectors;
    将所述第一欧式距离与预设的第一距离阈值进行比对;Comparing the first Euclidean distance with a preset first distance threshold;
    当所述第一欧式距离小于所述第一距离阈值时,将所述目标特征向量聚类至所述第一欧式距离表征的聚类集中生成一级聚类集。When the first Euclidean distance is less than the first distance threshold, clustering the target feature vector into a cluster set represented by the first Euclidean distance to generate a first-level cluster set.
PCT/CN2019/117240 2019-09-18 2019-11-11 Pharmaceutical drug classification method and apparatus, computer device and storage medium WO2020220635A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
SG11202008417RA SG11202008417RA (en) 2019-09-18 2019-11-11 Drug classificatiion method, device, computer, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910881521.5 2019-09-18
CN201910881521.5A CN110781298B (en) 2019-09-18 2019-09-18 Medicine classification method, apparatus, computer device and storage medium

Publications (1)

Publication Number Publication Date
WO2020220635A1 true WO2020220635A1 (en) 2020-11-05

Family

ID=69383808

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117240 WO2020220635A1 (en) 2019-09-18 2019-11-11 Pharmaceutical drug classification method and apparatus, computer device and storage medium

Country Status (3)

Country Link
CN (1) CN110781298B (en)
SG (1) SG11202008417RA (en)
WO (1) WO2020220635A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112906395A (en) * 2021-03-26 2021-06-04 平安科技(深圳)有限公司 Drug relationship extraction method, device, equipment and storage medium
CN117316373A (en) * 2023-10-08 2023-12-29 医顺通信息科技(常州)有限公司 HIS-based medicine whole-flow supervision system and method thereof

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111627566A (en) * 2020-05-22 2020-09-04 泰康保险集团股份有限公司 Indication information processing method and device, storage medium and electronic equipment
CN111738014B (en) * 2020-06-16 2023-09-08 北京百度网讯科技有限公司 Drug classification method, device, equipment and storage medium
CN111832661B (en) * 2020-07-28 2024-04-02 平安国际融资租赁有限公司 Classification model construction method, device, computer equipment and readable storage medium
CN112035664A (en) * 2020-08-28 2020-12-04 平安医疗健康管理股份有限公司 Medicine classification method and device and computer equipment
CN112466476A (en) * 2020-12-17 2021-03-09 贝医信息科技(上海)有限公司 Epidemiology trend analysis method and device based on medicine flow direction data
CN113488194B (en) * 2021-05-25 2023-04-07 四川大学华西医院 Medicine identification method and device based on distributed system
CN113569994A (en) * 2021-08-30 2021-10-29 平安医疗健康管理股份有限公司 Method, device, equipment and storage medium for identifying medical records of the same thunder
CN113470779B (en) * 2021-09-03 2021-11-26 壹药网科技(上海)股份有限公司 Medicine category identification method and system
TWI781856B (en) * 2021-12-16 2022-10-21 新加坡商鴻運科股份有限公司 Method for identifying medicine image, computer device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408631A (en) * 2018-09-03 2019-03-01 平安医疗健康管理股份有限公司 Drug data processing method, device, computer equipment and storage medium
US20190131007A1 (en) * 2017-11-02 2019-05-02 Ir2Dx, Inc. Systems and Methods for Providing Professional Treatment Guidance for Diabetes Patients
CN109830302A (en) * 2019-01-28 2019-05-31 北京交通大学 Medication mode excavation method, apparatus and electronic equipment
CN110223751A (en) * 2019-05-16 2019-09-10 平安科技(深圳)有限公司 Prescription evaluation method, system and computer equipment based on medical knowledge map
CN110245217A (en) * 2019-06-17 2019-09-17 京东方科技集团股份有限公司 A kind of drug recommended method, device and electronic equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5317783B2 (en) * 2009-03-25 2013-10-16 株式会社東芝 Drug information management device, drug information management method, and drug information management system
CN108831559B (en) * 2018-06-20 2021-01-15 清华大学 Chinese electronic medical record text analysis method and system
CN108875845B (en) * 2018-07-26 2024-02-20 广东数相智能科技有限公司 Medicine sorting device
CN110021439B (en) * 2019-03-07 2023-01-24 平安科技(深圳)有限公司 Medical data classification method and device based on machine learning and computer equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190131007A1 (en) * 2017-11-02 2019-05-02 Ir2Dx, Inc. Systems and Methods for Providing Professional Treatment Guidance for Diabetes Patients
CN109408631A (en) * 2018-09-03 2019-03-01 平安医疗健康管理股份有限公司 Drug data processing method, device, computer equipment and storage medium
CN109830302A (en) * 2019-01-28 2019-05-31 北京交通大学 Medication mode excavation method, apparatus and electronic equipment
CN110223751A (en) * 2019-05-16 2019-09-10 平安科技(深圳)有限公司 Prescription evaluation method, system and computer equipment based on medical knowledge map
CN110245217A (en) * 2019-06-17 2019-09-17 京东方科技集团股份有限公司 A kind of drug recommended method, device and electronic equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112906395A (en) * 2021-03-26 2021-06-04 平安科技(深圳)有限公司 Drug relationship extraction method, device, equipment and storage medium
CN112906395B (en) * 2021-03-26 2023-08-15 平安科技(深圳)有限公司 Drug relation extraction method, device, equipment and storage medium
CN117316373A (en) * 2023-10-08 2023-12-29 医顺通信息科技(常州)有限公司 HIS-based medicine whole-flow supervision system and method thereof
CN117316373B (en) * 2023-10-08 2024-04-12 医顺通信息科技(常州)有限公司 HIS-based medicine whole-flow supervision system and method thereof

Also Published As

Publication number Publication date
CN110781298B (en) 2023-06-20
CN110781298A (en) 2020-02-11
SG11202008417RA (en) 2020-12-30

Similar Documents

Publication Publication Date Title
WO2020220635A1 (en) Pharmaceutical drug classification method and apparatus, computer device and storage medium
CN111414393B (en) Semantic similar case retrieval method and equipment based on medical knowledge graph
WO2021047186A1 (en) Method, apparatus, device, and storage medium for processing consultation dialogue
US20200075135A1 (en) Trial planning support apparatus, trial planning support method, and storage medium
CN113345577B (en) Diagnosis and treatment auxiliary information generation method, model training method, device, equipment and storage medium
CN112365939B (en) Data management method and system based on medical health big data
CN112820416A (en) Major infectious disease queue data typing method, typing model and electronic equipment
CN105843889A (en) Credibility based big data and general data oriented data collection method and system
WO2021008601A1 (en) Method for testing medical data
CN116910172A (en) Follow-up table generation method and system based on artificial intelligence
CN111797267A (en) Medical image retrieval method and system, electronic device and storage medium
CN115050442A (en) Disease category data reporting method and device based on mining clustering algorithm and storage medium
Kaur et al. Image content based retrieval system using cosine similarity for skin disease images
Najadat et al. A classifier to detect abnormality in CT brain images
CN112071431B (en) Clinical path automatic generation method and system based on deep learning and knowledge graph
JP2001175724A (en) System for analyzing medical fee bill
CN115083550B (en) Patient similarity classification method based on multi-source information
CN106844325A (en) Medical information processing method and medical information processing unit
CN111667023B (en) Method and device for acquiring articles of target category
Rao et al. COVID-19 detection method based on SVRNet and SVDNet in lung x-rays
CN114936153A (en) Turing test method of artificial intelligence software
CN112016302B (en) Identification method and device for decomposing hospitalization behaviors, electronic equipment and storage medium
CN110880360A (en) Parkinson disease data set classification method based on sparse representation
Junior et al. A study of the influence of textual features in learning medical prior authorization
CN117577348B (en) Identification method and related device for evidence-based medical evidence

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19926795

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19926795

Country of ref document: EP

Kind code of ref document: A1