CN114068028A - Medical inquiry data processing method and device, readable storage medium and electronic equipment - Google Patents

Medical inquiry data processing method and device, readable storage medium and electronic equipment Download PDF

Info

Publication number
CN114068028A
CN114068028A CN202111371847.7A CN202111371847A CN114068028A CN 114068028 A CN114068028 A CN 114068028A CN 202111371847 A CN202111371847 A CN 202111371847A CN 114068028 A CN114068028 A CN 114068028A
Authority
CN
China
Prior art keywords
data
processed
nearest neighbor
value
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111371847.7A
Other languages
Chinese (zh)
Inventor
马伯毅
张誉丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taikang Insurance Group Co Ltd
Original Assignee
Taikang Insurance Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taikang Insurance Group Co Ltd filed Critical Taikang Insurance Group Co Ltd
Priority to CN202111371847.7A priority Critical patent/CN114068028A/en
Publication of CN114068028A publication Critical patent/CN114068028A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Abstract

The disclosure relates to the technical field of data processing, and provides a medical inquiry data processing method and device, a computer readable storage medium and an electronic device. Wherein, the method comprises the following steps: preprocessing inquiry data to be processed to determine a characteristic vector corresponding to the inquiry data to be processed; inputting the characteristic vector into a preset K nearest neighbor model to determine the top N target categories with the most frequent occurrence in the categories to which the K samples adjacent to the inquiry data to be processed belong; determining a characteristic value corresponding to a target category in a characteristic vector corresponding to the to-be-processed inquiry data so as to generate a target characteristic vector corresponding to the to-be-processed inquiry data; and adjusting a K value in a preset K nearest neighbor model, and inputting the target characteristic vector into the preset K nearest neighbor model after the K value is adjusted so as to classify the inquiry data to be processed. The scheme is based on secondary processing of the K nearest neighbor model identification result, and the accuracy of medical inquiry data classification can be improved.

Description

Medical inquiry data processing method and device, readable storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a medical inquiry data processing method, a medical inquiry data processing apparatus, a computer-readable storage medium, and an electronic device.
Background
In an online internet hospital, doctors can know the symptoms of patients in a picture-text chatting mode and put forward diagnosis opinions. Many diseases require multiple revisits and doctors often need to look for chat to recall the patient's previous health status. Because the patient is not a professional doctor and cannot accurately describe the health problem of the patient through one or two sentences, a great amount of useless information is generated in the chat process, and the efficiency of the doctor for reviewing symptoms is influenced.
Therefore, the medical inquiry data can be classified and stored, so that the review efficiency is improved.
The K nearest neighbor model (KNN, K-nearest neighbor) is simple and effective and is suitable for automatic classification of class domains with larger samples. However, the existing K-nearest neighbor model has low classification accuracy due to the limitation of K value selection.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The present disclosure aims to provide a medical inquiry data processing method and apparatus, a computer-readable storage medium, and an electronic device, so as to at least improve the problems of low review efficiency and low inquiry data classification accuracy to some extent.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to a first aspect of the present disclosure, there is provided a medical inquiry data processing method, including: preprocessing inquiry data to be processed to determine a characteristic vector corresponding to the inquiry data to be processed; inputting the feature vectors into a preset K nearest neighbor model to determine the top N target categories with the most frequent occurrence in the categories to which the K samples adjacent to the inquiry data to be processed belong; determining a characteristic value corresponding to the target category in a characteristic vector corresponding to the to-be-processed inquiry data so as to generate a target characteristic vector corresponding to the to-be-processed inquiry data; adjusting a K value in the preset K nearest neighbor model, and inputting a target feature vector corresponding to the inquiry data to be processed into the preset K nearest neighbor model after the K value is adjusted so as to classify the inquiry data to be processed.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the feature vector corresponding to the to-be-processed inquiry data is used to indicate the occurrence frequency of the keyword corresponding to each preset category in the to-be-processed inquiry data; the preprocessing is performed on the inquiry data to be processed to determine the characteristic vector corresponding to the inquiry data to be processed, and the preprocessing comprises the following steps: performing word segmentation processing on the inquiry data to be processed; matching the word segmentation result of the inquiry data to be processed with the keywords in the word bank corresponding to each preset category to determine the occurrence frequency of the keywords corresponding to each preset category in the inquiry data to be processed; and obtaining a characteristic vector corresponding to the inquiry data to be processed according to the occurrence frequency.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the preset K-nearest neighbor model is predetermined by: obtaining a plurality of initial K nearest neighbor models based on different preset initial K values; acquiring a training data set, and dividing the training data set into a test data set and a sample data set; respectively determining the identification accuracy of each initial K nearest neighbor model according to the test data set and the sample data set; selecting a K nearest neighbor model with the highest identification accuracy from the plurality of initial K nearest neighbor models to obtain the preset K nearest neighbor model; for each initial K nearest neighbor model, determining the identification accuracy of the initial K nearest neighbor model by executing the following processes: inputting the test data into the initial K nearest neighbor model aiming at each test data in the test data set so as to determine the top N categories to be selected, which have the highest frequency, in the categories of K sample data adjacent to the test data; determining a characteristic value corresponding to the category to be selected in the characteristic vector corresponding to the test data to generate a target characteristic vector corresponding to the test data; adjusting a preset initial K value in the initial K nearest neighbor model to obtain a target K value; determining an identification category corresponding to the test data based on a target feature vector of the test data and the target K value by taking the sample data set as a sample in the initial K nearest neighbor model; and determining the identification accuracy of the initial K nearest neighbor model based on the identification category corresponding to each piece of test data and the label corresponding to each piece of test data.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the adjusting a preset initial K value in the initial K nearest neighbor model to obtain a target K value includes: and reducing a preset initial K value in the initial K nearest neighbor model to obtain a target K value corresponding to the preset initial K value.
In an exemplary embodiment of the disclosure, based on the foregoing scheme, the adjusting the K value in the preset K nearest neighbor model includes: acquiring a target K value corresponding to the initial K value in the preset K nearest neighbor model; adjusting the K value in the preset K nearest neighbor model from the initial K value to the target K value; and the target K value is obtained by adjusting a preset initial K value corresponding to the preset K nearest neighbor model in the model training process.
In an exemplary embodiment of the present disclosure, the preset categories include one or more of a description of a condition, a diagnosis result, a drug use, and a related examination, based on the foregoing scheme.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the method further includes: storing the classification result of the inquiry data to be processed in a database; when the to-be-processed inquiry data have the mark attribute, selecting other inquiry data which belong to the same inquiry patient as the to-be-processed inquiry data and have the same category from a database; displaying the other inquiry data in a target client to prompt a user of the target client whether to mark the other inquiry data; in response to the tagging operation for the other interrogation data, tags are added to the other interrogation data in the database.
According to a second aspect of the present disclosure, there is provided a medical interrogation data processing apparatus comprising: the characteristic extraction module is configured to preprocess the inquiry data to be processed so as to determine a characteristic vector corresponding to the inquiry data to be processed; the initial classification module is configured to input the feature vectors into a preset K nearest neighbor model so as to determine the top N target classes with the highest frequency of occurrence in the classes to which the K samples adjacent to the to-be-processed inquiry data belong; a target feature determination module configured to determine a feature value corresponding to the target category in a feature vector corresponding to the to-be-processed inquiry data to generate a target feature vector corresponding to the to-be-processed inquiry data; and the secondary classification module is configured to adjust a K value in the preset K nearest neighbor model, and input a target feature vector corresponding to the to-be-processed inquiry data into the preset K nearest neighbor model after the K value is adjusted so as to classify the to-be-processed inquiry data.
According to a third aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the medical interrogation data processing method as described in the first aspect of the embodiments above.
According to a fourth aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; and a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method of processing medical interrogation data as described in the first aspect of the embodiments above.
As can be seen from the foregoing technical solutions, the medical inquiry data processing method, the medical inquiry data processing apparatus, and the computer-readable storage medium and the electronic device for implementing the medical inquiry data processing method in the exemplary embodiment of the present disclosure have at least the following advantages and positive effects:
in the technical solutions provided by some embodiments of the present disclosure, firstly, to-be-processed inquiry data is preprocessed to determine a feature vector corresponding to the to-be-processed inquiry data, then, the feature vector is input into a preset K nearest neighbor model to determine the first N target categories with the highest frequency of occurrence in categories to which K samples adjacent to the to-be-processed inquiry data belong, and in feature values corresponding to the to-be-processed inquiry data, feature values corresponding to the target categories are determined to generate target feature vectors corresponding to the to-be-processed inquiry data; and finally, adjusting the K value in the preset K nearest neighbor model, and inputting the target characteristic vector into the preset K nearest neighbor model after the K value is adjusted so as to classify the inquiry data to be processed. Compared with the related art, on one hand, the classification accuracy of the medical inquiry data can be improved by carrying out secondary processing on the classification result of the K nearest neighbor model; on the other hand, medical inquiry data are classified, so that doctors can be assisted to quickly check corresponding inquiry records from different classifications, and further the review efficiency is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.
FIG. 1 is a flow diagram illustrating a method for processing medical textual data in an exemplary embodiment of the present disclosure;
FIG. 2 illustrates a flow diagram of a method of determining feature vectors for pending interrogation data in an exemplary embodiment of the present disclosure;
FIG. 3 shows a flowchart of a method of determining a preset K nearest neighbor model in an exemplary embodiment of the present disclosure;
FIG. 4 illustrates a flowchart of a method of determining recognition accuracy of an initial K-nearest neighbor model in an exemplary embodiment of the present disclosure;
FIG. 5 illustrates a flow diagram of a method of tagging categorized interrogation data in an exemplary embodiment of the present disclosure;
FIG. 6 is a schematic diagram illustrating an exemplary embodiment of a medical interrogation data processing apparatus according to the present disclosure;
FIG. 7 shows a schematic diagram of a structure of a computer storage medium in an exemplary embodiment of the disclosure; and the number of the first and second groups,
fig. 8 shows a schematic structural diagram of an electronic device in an exemplary embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
The terms "a," "an," "the," and "said" are used in this specification to denote the presence of one or more elements/components/parts/etc.; the terms "comprising" and "having" are intended to be inclusive and mean that there may be additional elements/components/etc. other than the listed elements/components/etc.; the terms "first" and "second", etc. are used merely as labels, and are not limiting on the number of their objects.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities.
The online internet hospitals are gradually popularized, and for many users who do not have time or are inconvenient to go to the hospitals to see a doctor, the users can chat with doctors through the online internet hospitals, and the doctors can know diseases of patients through the image-text chat to put forward diagnosis opinions.
In reality, many diseases require multiple repeated diagnoses. For online internet hospitals, physicians often need to look up previous chat records to recall the patient's previous health status. In the chat process, because the patient is not a professional doctor and cannot accurately describe the health problem of the patient through one or two sentences, a large amount of useless information is generated in the chat process, and the disease reviewing efficiency of the doctor is affected.
Therefore, the chat contents can be analyzed, the chat contents which may be similar can be intelligently grouped, when the doctor needs to check the corresponding contents, the doctor can be intelligently prompted about the chat contents which may need to be checked, and the doctor review efficiency is improved.
The machine learning classification algorithm is a classification method widely applied, and common machine learning classification algorithms include an SVM (support vector machines) algorithm and an LR (Logistic Regression) algorithm. However, when the amount of sample data is too large, the SVM algorithm occupies a large amount of machine memory and operation time, the amount of chat information data of doctors and patients is huge, and the SVM algorithm is too heavy to be processed. The LR algorithm outputs a probability between 0 and 1 instead of an accurate value, and this probabilistic result may easily cause confusion in data classification, and the LR algorithm may easily cause an under-fitting problem due to the probabilistic result.
And the K nearest neighbor model is simple and effective in calculation and is suitable for automatic classification of class domains with larger samples. However, the accuracy of the conventional K-nearest neighbor model depends on the selection of the K value to a great extent, so that the accuracy of the classification result is still low.
In an embodiment of the present disclosure, a medical inquiry data processing method is provided to overcome at least some of the above-mentioned shortcomings in the related art.
Fig. 1 shows a flow chart of a medical interrogation data processing method in an exemplary embodiment of the disclosure. Referring to fig. 1, the method includes:
step S110, preprocessing inquiry data to be processed to determine a characteristic vector corresponding to the inquiry data to be processed;
step S120, inputting the feature vectors into a preset K nearest neighbor model to determine the top N target categories with the highest frequency of occurrence in the categories to which the K samples adjacent to the inquiry data to be processed belong;
step S130, determining a characteristic value corresponding to the target category in a characteristic vector corresponding to the to-be-processed inquiry data so as to generate a target characteristic vector corresponding to the to-be-processed inquiry data;
step S140, adjusting a K value in the preset K nearest neighbor model, and inputting the target feature vector corresponding to the to-be-processed inquiry data into the preset K nearest neighbor model after the K value is adjusted, so as to classify the to-be-processed inquiry data.
In the technical scheme provided by the embodiment shown in fig. 1, firstly, preprocessing inquiry data to be processed to determine a feature vector corresponding to the inquiry data to be processed, then, inputting the feature vector into a preset K nearest neighbor model to determine the first N target categories with the highest frequency of occurrence in categories to which K samples adjacent to the inquiry data to be processed belong, and determining a feature value corresponding to the target category in the feature vector corresponding to the inquiry data to be processed to generate a target feature vector corresponding to the inquiry data to be processed; and finally, adjusting the K value in the preset K nearest neighbor model, and inputting the target characteristic vector into the preset K nearest neighbor model after the K value is adjusted so as to classify the inquiry data to be processed. Compared with the related art, on one hand, the classification accuracy of the medical inquiry data can be improved by carrying out secondary processing on the classification result of the K nearest neighbor model; on the other hand, medical inquiry data are classified, so that doctors can be assisted to quickly check corresponding inquiry records from different classifications, and further the review efficiency is improved.
The following detailed description of the various steps in the example shown in fig. 1:
step S110, preprocessing the inquiry data to be processed to determine the characteristic vector corresponding to the inquiry data to be processed.
In an exemplary embodiment, the to-be-processed inquiry data may include chat record information in an online medical inquiry process, where the chat record information may be in a picture format, a text format, a voice format, or the like, and this exemplary embodiment is not particularly limited to this.
The feature vectors corresponding to the to-be-processed inquiry data are used for indicating the occurrence frequency of the keywords corresponding to each preset category in the to-be-processed inquiry data. Wherein the preset categories may include one or more of a description of a condition, a diagnosis, a medication use, a related examination. For example, when the preset categories include disease description, diagnosis result, drug use, and related examination, it is finally determined to which category of the 4 preset categories, i.e., disease description, diagnosis result, drug use, and related examination, the to-be-processed inquiry data belongs. Of course, the preset category may also be configured according to actual requirements, and this is not particularly limited in this exemplary embodiment.
For example, fig. 2 is a schematic flow chart illustrating a method for determining feature vectors of to-be-processed inquiry data in an exemplary embodiment of the disclosure. Referring to fig. 2, the method may include steps S210 to S230. Wherein:
in step S210, word segmentation is performed on the to-be-processed inquiry data.
For example, performing word segmentation on the inquiry data to be processed may be understood as performing word segmentation on text information corresponding to the inquiry data to be processed.
When the inquiry data to be processed is a picture, text recognition can be performed on the picture to determine text information in the picture, and then word segmentation processing is performed on the recognized text information. When the inquiry data to be processed is voice, the voice data can be converted into a text, and then the converted text is subjected to word segmentation.
The method for performing word segmentation processing on the inquiry data to be processed may refer to the existing word segmentation technology, which is not described in detail in this exemplary embodiment.
In step S220, the word segmentation result of the to-be-processed inquiry data is matched with the keywords in the word bank corresponding to each preset category, so as to determine the occurrence frequency of the keywords corresponding to each preset category in the to-be-processed inquiry data.
After the word segmentation processing is performed on the inquiry data to be processed, the words in the acquired inquiry data to be processed can be matched with the keywords in the word bank corresponding to each preset category.
For example, word segmentation and extraction may be performed on the existing medical inquiry data to generate word banks of preset categories in advance, and each word bank of preset categories stores keywords corresponding to the medical inquiry data of the preset category. For example, the medical inquiry chat information is obtained from a database storing the medical inquiry chat information, then word segmentation is performed on the medical inquiry chat information, and each word segmentation is classified into each category of disease description, diagnosis result, drug use and related examination according to the meaning or specific content of the representation, so that a word library of each category is generated. For the word stock corresponding to the medicine use, the names of the conventional medicines or all the medicine nouns can be stored in the word stock corresponding to the class of the medicine use as keywords, and for the word stock corresponding to the class of the related examination, the names corresponding to the conventional examination can be stored in the word stock corresponding to the class of the related examination in advance as keywords. The generation method of the lexicon corresponding to each preset category is not limited to this.
For example, a thesaurus corresponding to each preset category may be acquired in step S220. And then words obtained after word segmentation processing is carried out on the inquiry data to be processed are respectively matched with the keywords in the word bank corresponding to each preset category, and the times of successful matching of the keywords in the word bank corresponding to each preset category in all the words in the inquiry data to be processed are counted, so that the times can be understood as the occurrence frequency of the keywords corresponding to each preset category in the inquiry data to be processed.
In step S230, a feature vector corresponding to the to-be-processed inquiry data is obtained according to the frequency of occurrence.
In an exemplary embodiment, the occurrence frequency of the keyword corresponding to a certain preset category in the to-be-processed inquiry data may be used to represent a characteristic value of the to-be-processed inquiry data in the preset category. The characteristic values of the to-be-processed inquiry data under each preset category can be combined together and expressed by using one characteristic vector, so that the characteristic vector corresponding to the to-be-processed inquiry data is generated.
For example, the specific implementation manner of step S230 may be: and determining the characteristic vector corresponding to the inquiry data to be processed according to the characteristic value corresponding to each preset category and the preset representation sequence of the characteristic value of each preset category in the characteristic vector.
For example, if it is specified in advance that the order of representation of the feature values of each preset category in the feature vector is disease description, diagnosis result, drug use, and related examination in sequence, 5 words are obtained after the word segmentation is performed on the to-be-processed inquiry data, 2 words in the word bank corresponding to the drug use category are successfully matched, 2 words in the word bank corresponding to the disease description category are successfully matched, 0 word in the word bank corresponding to the diagnosis result category is successfully matched, and 0 word in the word bank corresponding to the related examination category is successfully matched, then the feature vector corresponding to the to-be-processed inquiry data is [2, 0,2, 0 ].
With continuing reference to fig. 1, next, in step S120, the feature vectors are input into a preset K nearest neighbor model to determine the top N most frequently appearing target categories among the categories to which the K samples adjacent to the to-be-processed inquiry data belong.
The K nearest neighbor model, i.e. KNN (K-nearest neighbor) algorithm model, is the meaning of K nearest neighbors, i.e. each data to be classified can be represented by its nearest K neighbors. In other words, the distance from the data to be classified to each sample point of the known class can be calculated, then, each distance is sorted, the first K sample points closest to the data to be classified are selected, the occurrence frequency of the class where the first K sample points are located is counted, and the class with the highest occurrence frequency of the first K sample points is used as the classification result of the data to be classified.
In the K nearest neighbor model, the selection of a K value has a crucial influence on the classification accuracy, if the K value is selected too much, more interference samples can be selected, and therefore the classification accuracy of the model is reduced; if the value of K is chosen too small, overfitting tends to occur.
According to the method and the device, the initial classification result of the K nearest neighbor model is secondarily classified, the K value is selected according to the accuracy rate after secondary classification, and then the preset K nearest neighbor model is determined, so that the classification accuracy of the K nearest neighbor model is improved.
For example, fig. 3 shows a flowchart of a method for determining a preset K-nearest neighbor model in an exemplary embodiment of the disclosure, and referring to fig. 3, the method may include steps S310 to S340.
In step S310, a plurality of initial K nearest neighbor models are obtained based on different preset initial K values.
For example, a plurality of different preset initial K values may be preconfigured to correspondingly obtain a plurality of initial K nearest neighbor models. If the K value is 7, 8, 9 and 10 respectively, 4K nearest neighbor models are determined. The number of the initial K nearest neighbor models and the preset initial K value of each initial K nearest neighbor model may be configured by user according to requirements, which is not particularly limited in the present exemplary embodiment.
Next, in step S320, a training data set is obtained, and the training data set is divided into a test data set and a sample data set.
For example, historical inquiry chat information may be obtained from a storage database of inquiry chat record information, and each piece of historical inquiry record information is labeled to determine a category corresponding to each piece of historical inquiry record information, so as to generate a category label corresponding to each piece of inquiry record information. The historical inquiry chat log information may then be processed in a similar manner as described above in steps S210-S230 to determine a feature vector corresponding to each historical inquiry chat log information. That is, the manner of determining the feature vector corresponding to the historical inquiry chat history information may refer to the above steps S210 to S230, and is not described herein again.
In other words, each training data in the training data set may include a feature vector and a label to which the feature vector corresponds. After the training data set is obtained, the training data set may be divided into a test data set and a sample data set. The division ratio of the test data set and the sample data set can be determined according to requirements, for example, 60% is used as the test data set, and 40% is used as the sample data set. The present exemplary embodiment is not particularly limited in this regard. The test data set is used to test the identification accuracy of each initial K-nearest neighbor model in subsequent steps, and the sample data set may be used as sample points in each initial K-nearest neighbor model.
Next, in step S330, the identification accuracy of each initial K nearest neighbor model is respectively determined according to the test data set and the sample data set.
For example, the specific implementation manner of step S330 may be understood as taking the data in the test data set as the data to be classified, and taking the data in the sample data set as the sample point in each initial K-nearest neighbor model. And classifying each test data by using the sample points in each initial K nearest neighbor model to determine the identification accuracy of each initial K nearest neighbor model.
For each initial K-nearest neighbor model, a method for determining the identification accuracy thereof may be as shown in fig. 4, and the method may include steps S410 to S450. Wherein:
in step S410, for each test data in the test data set, the test data is input into the initial K nearest neighbor model to determine the top N categories to be selected, which appear most frequently among the categories to which K sample data adjacent to the test data belong.
In an exemplary embodiment, N is greater than or equal to 2 and less than the number of preset categories, and N is a positive integer, for example, if it is necessary to determine the category to which the inquiry data to be processed belongs among the 4 preset categories, N is greater than or equal to 2 and less than 4. The sample points in the initial K-nearest neighbor model may include data in the sample data set determined in step S320 described above.
For example, a specific implementation manner of step S410 may be that, for each test data in the test data set, the test data is input into the initial K-nearest neighbor model, a distance between the test data and each sample point in the initial K-nearest neighbor model is calculated, and the top N top candidate categories with the highest frequency of occurrence in the categories to which the top K sample points closest to the test data belong are determined according to the distance.
The distance metric may include any one of an euclidean distance, a manhattan distance, a chebyshev distance, a normalized euclidean distance, a mahalanobis distance, a hamming distance, a babbitt distance, and the like, which is not particularly limited in the present exemplary embodiment.
The feature vector corresponding to the test data is [4,5,2,1]]The feature vector corresponding to a certain sample point is [4,10, 1]]The distance measurement takes the Euclidean distance as an example, and the distance between the test data and the sample point is
Figure BDA0003362556380000111
Taking an initial K nearest neighbor model with a preset initial K value of 8 as an example, then inputting the test data into the initial K nearest neighbor model, calculating the distance between the test data and each sample point in the initial K nearest neighbor model, sorting the distances corresponding to each sample point in a descending or ascending order, and selecting the first 8 sample points with the smallest distance to the test data. The frequency of occurrence of the category to which the 8 sample points belong is counted. Taking N as an example, the first 2 categories with the most frequent occurrence of the categories to which the 8 sample points belong are selected as the categories to be selected.
If there are 3 sample points belonging to the disorder description category among the 8 sample points, that is, the frequency of occurrence of the disorder description category is 3,4 sample points belonging to the diagnosis result category, that is, the frequency of occurrence of the diagnosis result category is 4, and 1 sample point belonging to the medication use category, that is, the frequency of occurrence of the medication use category is 1, and 0 sample point belonging to the relevant examination category, that is, the frequency of occurrence of the relevant examination category is 0, the diagnosis result category and the disorder description category may be regarded as the candidate categories.
In step S420, in the feature vector corresponding to the test data, a feature value corresponding to the category to be selected is determined, so as to generate a target feature vector corresponding to the test data.
Continuing to take the above feature vector corresponding to the test data as [4,5,2,1] as an example, where 4 represents a feature value corresponding to a disease description category, 5 represents a feature value corresponding to a diagnosis result category, 2 represents a feature value corresponding to a drug use category, and 1 represents a feature value corresponding to a relevant examination category. After the category to be selected is determined, only the feature value corresponding to the category to be selected can be reserved in the feature vector corresponding to the test data, so that the target feature vector corresponding to the test data is generated.
Continuing with the above example that the candidate categories include a diagnosis result category and a disease description category, the target feature vector corresponding to the test data may be represented as [4,5,0,0 ].
For example, the specific implementation manner of step S420 may be that, in the feature vector corresponding to the test data, the feature value corresponding to the to-be-selected category is retained to generate the target feature vector corresponding to the test data.
Next, in step S430, a preset initial K value in the initial K nearest neighbor model is adjusted to obtain a target K value.
For example, a specific implementation manner of adjusting the preset initial K value in the initial K nearest neighbor model to obtain the target K value may be to reduce the preset initial K value in the initial K nearest neighbor model to obtain the target K value corresponding to the preset initial K value.
After the target K value is obtained, in step S440, the sample data set is used as a sample in the initial K nearest neighbor model, and an identification category corresponding to the test data is determined based on the target feature vector corresponding to the test data and the target K value.
For example, the preset initial K value may be adjusted to half of the original value to obtain a target K value, and taking the preset initial K value as 8 as an example, a target K value of 4 may be obtained. Then, the feature values corresponding to the feature values in the target feature vector and the feature values corresponding to the to-be-selected categories in the sample points of the initial K nearest neighbor model are differentiated, the distances between the test data and the sample points are calculated, each distance is sorted, the first 4 sample points with the smallest distance are selected, the category with the highest frequency of appearance in the categories to which the first 4 sample points belong is determined as the identification category corresponding to the test data, and the identification category can be understood as the category identified by the test data through the initial K nearest neighbor model.
Continuing to take the feature vector corresponding to the test data as [4,5,2,1]]The feature vector corresponding to a certain sample point is [4,10, 1]]The distance measure is implemented by using euclidean distance as an example, and the distance between the test data and the sample point can be represented as the distance in step S440
Figure BDA0003362556380000131
Can also be expressed as
Figure BDA0003362556380000132
That is, feature values corresponding to the categories to be selected may be directly and respectively selected from the test data and the sample points, that is, feature values corresponding to the categories to be selected in the test data and the sample points are respectively retained, and then the distance between the test data and the sample points is calculated, or feature values corresponding to the categories to be selected in the test data and the sample points may be determined to be 0, so as to formally delete other features except the feature values corresponding to the categories to be selected, and then other features except the feature values corresponding to the categories to be selected are deleted, and then the feature values corresponding to the categories to be selected in the test data and the sample points are determined to be 0The distance between the test data and the sample point is calculated. And determining the identification type corresponding to the test data based on the calculated distance and the target K value.
With continuing reference to fig. 4, in step S450, the identification accuracy of the initial K-nearest neighbor model is determined based on the identification category corresponding to each of the test data and the label corresponding to each of the test data.
For example, for each test datum, in the above steps S410 to S440, the identification category of the test datum is obtained based on the initial K-nearest neighbor model. Then, the identification category of each test data and the corresponding label thereof may be matched or compared, and whether the two are the same or not is judged, and if the two are the same, it may be determined that the identification of the test data is successful. And finally, determining the identification accuracy of the initial K nearest neighbor model according to the percentage of the successfully identified test data in all the test data.
Furthermore, for the same preset initial K value, a plurality of target K values may be determined respectively, and then the target K value with the highest accuracy after secondary identification of the plurality of target K values is selected as the final corresponding target K value of the preset initial K value.
For example, in the above step S430, the preset initial K value may be reduced to
Figure BDA0003362556380000141
To obtain a first target K value, if the preset initial K value 8 is reduced to 4, and 4 is taken as the first target K value corresponding to the preset initial K value 8, then the recognition accuracy when the target K value is 4 is determined through the above steps S440 and S450. Returning to step S430, the target K value is increased to
Figure BDA0003362556380000142
I.e., 5, as the second target K value, and repeating the above steps S440 and S450, the recognition accuracy when the target K value is 5 is determined. Then, comparing the identification accuracy rate corresponding to the first target K value with the identification accuracy rate corresponding to the second target K value, and obtaining the rootAnd determining whether to continue to determine the third target K value according to the magnitude relation between the first target K value and the second target K value.
If the recognition accuracy of the first target K value is greater than that of the second target K value, the determination of the third target K value is not required; if the recognition accuracy of the first target K value is less than the second target K value, the target K value may be increased to
Figure BDA0003362556380000143
And then comparing the recognition accuracy of the third target K value with the recognition accuracy of the second target K value, and repeating the steps until the recognition accuracy corresponding to the (M + 1) th target K value is smaller than that of the (M + 1) th target K value, so that the (M) th target K value can be determined as the final target K value corresponding to the preset initial K value.
It should be noted that if the M +1 th target K value is the preset initial K value minus 1, i.e., K-1, and the recognition accuracy corresponding to the M +1 th target K value is greater than that of the mth target K value, if the adjustment is continued again, the target K value is equal to the preset initial K value, and then the target K value does not need to be increased again, but K-1 is directly determined as the target K value corresponding to the preset initial K value.
After the identification accuracy of each initial K-nearest neighbor model is obtained through the above steps S410 to S450, next, referring to fig. 3, in step S340, the K-nearest neighbor model with the highest identification accuracy is selected from the multiple initial K-nearest neighbor models to obtain the preset K-nearest neighbor model.
For example, when comparing the recognition accuracy rates of a plurality of initial K nearest neighbor models, the recognition accuracy rate corresponding to the target K value finally corresponding to each of the initial K nearest neighbor models may be used for comparing, so as to determine the initial K nearest neighbor model with the highest recognition accuracy rate as the preset K nearest neighbor model.
It should be noted that the finally determined condition that the recognition accuracy of the preset K nearest neighbor model needs to be satisfied may be determined according to the service requirement, and if the service requirement requires that the recognition accuracy is greater than 75%, the finally determined recognition accuracy of the preset K nearest neighbor model needs to be greater than or equal to 75%. When the preset K nearest neighbor model is determined, a plurality of groups of test data sets and sample data sets can be obtained according to different partition modes or partition strategies, and each group of test data sets and sample data sets are used for carrying out multiple tests on the initial K nearest neighbor model to determine a model with the highest identification accuracy, and the model is used as the preset K nearest neighbor model.
After the preset K nearest neighbor model is determined, the preset K nearest neighbor model can be used for classifying and identifying the inquiry data to be processed.
For example, in step S120, the to-be-processed inquiry data may be first identified by using the initial K value of the preset K nearest neighbor model, so as to determine the target class from the preset classes.
For example, after the feature vector corresponding to the inquiry data to be processed is determined in step S110, the distance between the feature vector and the feature vector corresponding to each sample point in the preset K nearest neighbor model, such as the euclidean distance, may be calculated, and then the calculated distances may be sorted to determine the first K sample points with the smallest distance, where K is the initial K value of the preset K nearest neighbor model. And determining the top N target categories with the most frequent occurrence in the categories to which the K sample points belong.
Taking the initial K value as 8 and N as 2 as an example, 8 sample points with the minimum distance from the to-be-processed inquiry data are determined, and the top 2 categories with the highest frequency of occurrence are selected as target categories according to the categories to which the 8 sample points belong. For example, if the frequency of occurrence of the disease description category in the categories to which the 8 sample points belong is 3, the frequency of occurrence of the diagnosis result category is 3, the frequency of occurrence of the drug use category is 2, and the frequency of occurrence of the relevant examination category is 0, the disease description and the diagnosis result may be determined as the target category.
Next, in step S130, in the feature vector corresponding to the to-be-processed inquiry data, a feature value corresponding to the target category is determined, so as to generate a target feature vector corresponding to the to-be-processed inquiry data.
For example, after the target category is determined, the feature value corresponding to the target category may be retained in the feature vector corresponding to the to-be-processed inquiry data, and the feature values corresponding to other categories may be configured in an invalid state or set to 0, so as to perform distance calculation based on the target feature vector in step S140. Wherein the feature values configured in the invalid state do not participate in the distance calculation.
If the feature vector of the to-be-processed inquiry data is [6,3,4,1], where 6 and 4 are feature values corresponding to the target category, respectively, the target feature vector of the to-be-processed inquiry data may be [6,0,4,0], or feature values 3 and 1 in the target feature vector of the to-be-processed inquiry data are configured to be in an invalid state, so that they do not participate in the distance calculation in step S140.
Next, in step S140, adjusting a K value in the preset K nearest neighbor model, and inputting the target feature vector into the K value-adjusted preset K nearest neighbor model to classify the to-be-processed inquiry data.
In an exemplary embodiment, the specific implementation manner of adjusting the K value in the preset K-nearest neighbor model may be to reduce the K value in the K-nearest neighbor model. The amount of reduction of the K value may be determined according to user requirements, for example, the K value is reduced to half of the initial K value.
In another exemplary embodiment, the process of determining the preset K-nearest neighbor model described in fig. 3 above may be understood as a model training process, and based on this, a specific embodiment of adjusting the K value in the preset K-nearest neighbor model may also be: acquiring a target K value corresponding to the initial K value in the preset K nearest neighbor model; adjusting the K value in the preset K nearest neighbor model from an initial K value to the target K value; and the target K value is obtained by adjusting a preset initial K value corresponding to the preset K nearest neighbor model in the model training process.
In other words, after the predetermined K-nearest neighbor model is determined, the initial K value of the predetermined K-nearest neighbor model and the target K value corresponding to the initial K value are determined accordingly, in step S120, the to-be-processed inquiry data may be classified for the first time by using the initial K value to determine the target class, and then the target feature vector corresponding to the to-be-processed inquiry data is determined based on the target class. Then, in step S140, the to-be-processed inquiry data is classified for the second time by using the target feature vector and the target K value, so as to determine the category to which the to-be-processed inquiry data belongs.
For example, the specific implementation process of performing the second classification on the to-be-processed inquiry data according to the target feature vector and the target K value to determine the category to which the to-be-processed inquiry data belongs may be: and selecting a characteristic value corresponding to a target category from the characteristic vectors corresponding to the sample points of the preset K nearest neighbor model to obtain a target characteristic vector corresponding to each sample point. And then calculating the distance between the target characteristic vector corresponding to each sample point and the target characteristic vector corresponding to the to-be-processed inquiry data, sequencing each distance, determining the category with the highest frequency of occurrence in the categories to which the K sample points with the smallest distance from the front target belong, and determining the category as the category to which the to-be-processed inquiry data finally belongs. The specific implementation manner of determining the target feature vector corresponding to each sampling point is completely the same as the specific implementation manner of determining the target feature vector corresponding to the to-be-processed inquiry data, and details are not repeated here.
The characteristic vector of the inquiry data to be processed is [6,3,4,1]]The feature vector corresponding to a certain sample point in the K nearest neighbor model is preset to be [10,2,5,4 ]]And the order of the feature values of the preset categories in the feature values is disease description, drug use, diagnosis result, correlation examination, and after the first classification, the determined target categories are disease description and diagnosis result, for example, then when the inquiry data to be processed is classified in step S140, the calculated distance between the inquiry data to be processed and the sample point may be set as
Figure BDA0003362556380000171
Through the steps S110 to S140, the problem of inaccurate classification result due to too large feature value can be avoided based on the secondary classification of the to-be-processed inquiry data. Specifically, for example, the category to which the sample point having the smallest distance from the to-be-processed inquiry data belongs may be the category a, and if the classification is performed only once using the initial K value, the number of sample points belonging to the category B among the first K sample points having the smallest distance from the to-be-processed inquiry data may be the largest due to the excessively large initial K value, and thus the to-be-processed inquiry data may be wrongly classified into the category B. In the method, the K value is reduced on the basis of the first classification so as to perform secondary classification on the inquiry data to be processed, and due to the reduction of the K value, more wrong sample points can be prevented from being determined as sample points with the minimum distance, and further the voting selection of the categories to which the inquiry data to be processed belongs can be prevented from being participated on the basis of the categories to which the wrong sample points belong, so that the condition of large errors can be avoided, and the accuracy of the classification of the inquiry data to be processed is improved.
In an exemplary application scenario, historical chat data can be obtained from a database storing online medical inquiry chat records, after the historical chat data is subjected to word segmentation processing through a word segmentation algorithm, an index keyword library corresponding to four preset categories of disease description, diagnosis results, medicine use and related examination is established according to the actual meaning represented by each word segmentation. And determining a feature vector corresponding to each historical chat data according to the occurrence frequency of the keyword corresponding to each preset category in each historical chat data, meanwhile, determining a label of each historical chat data based on the category to which each historical chat data actually belongs, wherein the corresponding label can be manually determined according to the actual meaning represented by each historical chat record, or can be directly determined according to the occurrence frequency of the keyword of each preset category in the historical chat data, and if the occurrence frequency of the keyword of a certain preset category in the historical chat data is the maximum, the label of the historical chat data can be directly determined as the category, so that a sample point in the K nearest neighbor model is generated.
Illustratively, chat data unrelated to the interrogation data may also be excluded first when generating the sample points. For example, whether each piece of chat data is useful chat data may be determined according to whether the number of matches of each piece of chat data with keywords in the keyword library corresponding to the preset category is greater than a preset value. If the sum of the occurrence frequencies of the keywords of each preset category of the certain chatting data is greater than a preset value, the chatting data can be determined to be useful chatting data and is reserved, otherwise, the chatting data is determined to be useless chatting data, for example, the chatting data may be only words which the patient feels to the doctor. Only the feature vectors and labels corresponding to the useful chat data need to be determined to generate the sample points in the K-nearest neighbor model.
When a new inquiry chat record is generated, after the chat is finished, the new chat record can be obtained from the database for storing the chat records, and then for each piece of chat data in the new chat record, based on the steps S110 to S140, the category corresponding to the chat data is determined, so as to store the chat day data in groups. Therefore, when the patient is in the double-diagnosis, the doctor can directly inquire the illness state information to be checked from the corresponding group according to the requirement, and the double-diagnosis efficiency is improved.
Each inquiry process is a process in which a doctor and a patient continuously communicate with each other, and generates a plurality of pieces of inquiry data, each piece of inquiry data may be regarded as one piece of inquiry data to be processed, and may be processed based on the above steps S110 to S140.
For example, when determining the category of the inquiry data, it may also be determined whether the sum of the occurrence frequencies of the keywords of each preset category in the inquiry data is greater than a preset value, and if so, the inquiry data is classified, otherwise, the chat data may be regarded as useless information, and the chat data may not be classified.
In order to further assist doctors and patients to improve the efficiency of the repeated diagnosis, the to-be-processed inquiry data can be marked according to the classification result of the to-be-processed inquiry data. Therefore, during the re-diagnosis, the medical inquiry data with the mark added in the medical inquiry data of the target patient can be rapidly displayed according to the requirements of the user.
For example, fig. 5 is a flow chart illustrating a method for labeling classified inquiry data in an exemplary embodiment of the disclosure. Referring to fig. 5, the method may include steps S510 to S540. Wherein:
in step S510, the classification result of the to-be-processed inquiry data is saved in a database.
For example, after the classification result of each piece of the to-be-processed inquiry data is determined, the category corresponding to the to-be-processed inquiry data may be stored in the database.
In step S520, when the data to be processed is attributed with a label, other data to be processed belonging to the same patient and having the same category as the data to be processed is selected from the database.
In step S530, the other inquiry data is displayed in the target client to prompt the user of the target client whether to mark the other inquiry data.
In step S540, in response to the marking operation for the other inquiry data, a mark is added to the other inquiry data in the database.
For example, each inquiry process is a process in which a doctor and a patient continuously communicate with each other, and multiple pieces of inquiry data are generated, and after each inquiry is finished, the doctor or the patient can add a mark to some key inquiry data in the inquiry process according to the needs of the doctor or the patient. For example, a marker is added to some key inquiry data in the inquiry process according to the needs of a doctor, and after the key inquiry data added with the marker by the doctor is obtained, the marker state corresponding to the inquiry data can be configured to be an effective state, so that the inquiry data has the marker attribute.
Then, selecting other inquiry data which belong to the same patient and have the same category as the inquiry data with the marking attribute, and sending the other inquiry data to a client corresponding to a doctor so as to recommend the other inquiry data which can be marked to the doctor; or determining all the inquiry data with determined categories in the plurality of inquiry data generated in the inquiry process from the database, selecting the inquiry data belonging to the same category as the inquiry data with the marking attribute from the inquiry data, and sending the selected inquiry data to the client corresponding to the doctor so as to recommend other inquiry data capable of being marked to the doctor.
The recommendation information may include information that alerts the physician whether the recommended interrogation data needs to be tagged. The physician may then select some or all of the recommended interrogation data to be tagged, thereby adding tagging attributes to the corresponding interrogation data in the database according to the physician's selection. Therefore, when the doctor carries out a follow-up diagnosis in the future, the inquiry data added with the mark attributes can be directly displayed to the doctor according to the requirements of the doctor, and the doctor can conveniently and quickly review the illness state of the patient. For example, in response to a trigger operation of a doctor on a client of the doctor for a certain control, the inquiry data with the mark attribute added in the historical inquiry data of the current patient is displayed on the client where the doctor is located.
Those skilled in the art will appreciate that all or part of the steps implementing the above embodiments are implemented as computer programs executed by a CPU. The computer program, when executed by the CPU, performs the functions defined by the method provided by the present invention. The program may be stored in a computer readable storage medium, which may be a read-only memory, a magnetic or optical disk, or the like.
Furthermore, it should be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the method according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
Fig. 6 shows a schematic configuration diagram of a medical inquiry data processing device in an exemplary embodiment of the present disclosure. Referring to fig. 6, the apparatus 600 may include a feature extraction module 610, an initial classification module 620, a target feature determination module 630, and a secondary classification module 640. Wherein:
the feature extraction module 610 is configured to pre-process the to-be-processed inquiry data to determine a feature vector corresponding to the to-be-processed inquiry data;
an initial classification module 620, configured to input the feature vectors into a preset K nearest neighbor model, so as to determine top N target classes with the highest frequency of occurrence in classes to which K samples adjacent to the to-be-processed inquiry data belong;
a target feature determination module 630, configured to determine, in the feature vector corresponding to the to-be-processed inquiry data, a feature value corresponding to the target category, so as to generate a target feature vector corresponding to the to-be-processed inquiry data;
the secondary classification module 640 is configured to adjust a K value in the preset K nearest neighbor model, and input a target feature vector corresponding to the to-be-processed inquiry data into the preset K nearest neighbor model after the K value is adjusted, so as to classify the to-be-processed inquiry data.
In some exemplary embodiments of the present disclosure, based on the foregoing embodiments, the feature vector corresponding to the to-be-processed inquiry data is used to indicate the occurrence frequency of the keyword corresponding to each preset category in the to-be-processed inquiry data.
In some exemplary embodiments of the present disclosure, based on the foregoing embodiments, the feature extraction module 610 may further be configured to: performing word segmentation processing on the inquiry data to be processed; matching the word segmentation result of the inquiry data to be processed with the keywords in the word bank corresponding to each preset category to determine the occurrence frequency of the keywords corresponding to each preset category in the inquiry data to be processed; and obtaining a characteristic vector corresponding to the inquiry data to be processed according to the occurrence frequency.
In some exemplary embodiments of the present disclosure, based on the foregoing embodiments, the preset K-nearest neighbor model is predetermined by: obtaining a plurality of initial K nearest neighbor models based on different preset initial K values; acquiring a training data set, and dividing the training data set into a test data set and a sample data set; respectively determining the identification accuracy of each initial K nearest neighbor model according to the test data set and the sample data set; and selecting the K nearest neighbor model with the highest identification accuracy from the plurality of initial K nearest neighbor models to obtain the preset K nearest neighbor model. For each initial K nearest neighbor model, determining the identification accuracy of the initial K nearest neighbor model by executing the following processes: inputting the test data into the initial K nearest neighbor model aiming at each test data in the test data set so as to determine the top N categories to be selected, which have the highest frequency, in the categories of K sample data adjacent to the test data; determining a characteristic value corresponding to the category to be selected in the characteristic vector corresponding to the test data to generate a target characteristic vector corresponding to the test data; adjusting a preset initial K value in the initial K nearest neighbor model to obtain a target K value; determining an identification category corresponding to the test data based on a target feature vector of the test data and the target K value by taking the sample data set as a sample in the initial K nearest neighbor model; and determining the identification accuracy of the initial K nearest neighbor model based on the identification category corresponding to each piece of test data and the label corresponding to each piece of test data.
In some exemplary embodiments of the present disclosure, based on the foregoing embodiments, the adjusting the preset initial K value in the initial K-nearest neighbor model to obtain a target K value includes: and reducing a preset initial K value in the initial K nearest neighbor model to obtain a target K value corresponding to the preset initial K value.
In some exemplary embodiments of the present disclosure, based on the foregoing embodiments, the secondary classification module 640 may be further specifically configured to: acquiring a target K value corresponding to the initial K value in the preset K nearest neighbor model; adjusting the K value in the preset K nearest neighbor model from the initial K value to the target K value; and the target K value is obtained by adjusting a preset initial K value corresponding to the preset K nearest neighbor model in the model training process.
In some exemplary embodiments of the present disclosure, based on the aforementioned embodiments, the preset categories include one or more of a description of a condition, a diagnosis result, a use of a drug, and a related examination.
In some exemplary embodiments of the present disclosure, based on the foregoing embodiments, the medical interrogation data processing apparatus 600 may further include a marking module specifically configured to: storing the classification result of the inquiry data to be processed in a database; when the to-be-processed inquiry data have the mark attribute, selecting other inquiry data which belong to the same inquiry patient as the to-be-processed inquiry data and have the same category from a database; displaying the other inquiry data in a target client to prompt a user of the target client whether to mark the other inquiry data; in response to the tagging operation for the other interrogation data, tags are added to the other interrogation data in the database.
The specific details of each unit in the medical inquiry data processing device are already described in detail in the corresponding medical inquiry data processing method, and therefore, the detailed description thereof is omitted here.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Moreover, although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, there is also provided a computer storage medium capable of implementing the above method. On which a program product capable of implementing the above-described method of the present specification is stored. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure described in the "exemplary methods" section above of this specification, when the program product is run on the terminal device.
Referring to fig. 7, a program product 700 for implementing the above method according to an embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.
As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
An electronic device 800 according to this embodiment of the disclosure is described below with reference to fig. 8. The electronic device 800 shown in fig. 8 is only an example and should not bring any limitations to the functionality and scope of use of the embodiments of the present disclosure.
As shown in fig. 8, electronic device 800 is in the form of a general purpose computing device. The components of the electronic device 800 may include, but are not limited to: the at least one processing unit 810, the at least one memory unit 820, a bus 830 connecting various system components (including the memory unit 820 and the processing unit 810), and a display unit 840.
Wherein the storage unit stores program code that is executable by the processing unit 810 to cause the processing unit 810 to perform steps according to various exemplary embodiments of the present disclosure as described in the "exemplary methods" section above in this specification. For example, the processing unit 810 may perform the various steps shown in fig. 1-5.
The storage unit 820 may include readable media in the form of volatile memory units such as a random access memory unit (RAM)8201 and/or a cache memory unit 8202, and may further include a read only memory unit (ROM) 8203.
The storage unit 820 may also include a program/utility 8204 having a set (at least one) of program modules 8205, such program modules 8205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 830 may be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 800 may also communicate with one or more external devices 900 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 800, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 800 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 850. Also, the electronic device 800 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 860. As shown, the network adapter 860 communicates with the other modules of the electronic device 800 via the bus 830. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 800, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
Furthermore, the above-described figures are merely schematic illustrations of processes included in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims (10)

1. A medical interrogation data processing method, comprising:
preprocessing inquiry data to be processed to determine a characteristic vector corresponding to the inquiry data to be processed;
inputting the feature vectors into a preset K nearest neighbor model to determine the top N target categories with the most frequent occurrence in the categories to which the K samples adjacent to the inquiry data to be processed belong;
determining a characteristic value corresponding to the target category in a characteristic vector corresponding to the to-be-processed inquiry data so as to generate a target characteristic vector corresponding to the to-be-processed inquiry data;
adjusting a K value in the preset K nearest neighbor model, and inputting a target feature vector corresponding to the inquiry data to be processed into the preset K nearest neighbor model after the K value is adjusted so as to classify the inquiry data to be processed.
2. The medical inquiry data processing method according to claim 1, wherein the feature vector corresponding to the to-be-processed inquiry data is used to indicate the occurrence frequency of the keyword corresponding to each preset category in the to-be-processed inquiry data;
the preprocessing is performed on the inquiry data to be processed to determine the characteristic vector corresponding to the inquiry data to be processed, and the preprocessing comprises the following steps:
performing word segmentation processing on the inquiry data to be processed;
matching the word segmentation result of the inquiry data to be processed with the keywords in the word bank corresponding to each preset category to determine the occurrence frequency of the keywords corresponding to each preset category in the inquiry data to be processed;
and obtaining a characteristic vector corresponding to the inquiry data to be processed according to the occurrence frequency.
3. The medical interrogation data processing method of claim 1, wherein the preset K nearest neighbor model is predetermined by:
obtaining a plurality of initial K nearest neighbor models based on different preset initial K values;
acquiring a training data set, and dividing the training data set into a test data set and a sample data set;
respectively determining the identification accuracy of each initial K nearest neighbor model according to the test data set and the sample data set;
selecting a K nearest neighbor model with the highest identification accuracy from the plurality of initial K nearest neighbor models to obtain the preset K nearest neighbor model;
for each initial K nearest neighbor model, determining the identification accuracy of the initial K nearest neighbor model by executing the following processes:
inputting the test data into the initial K nearest neighbor model aiming at each test data in the test data set so as to determine the top N categories to be selected, which have the highest frequency, in the categories of K sample data adjacent to the test data;
determining a characteristic value corresponding to the category to be selected in the characteristic vector corresponding to the test data to generate a target characteristic vector corresponding to the test data;
adjusting a preset initial K value in the initial K nearest neighbor model to obtain a target K value;
determining an identification category corresponding to the test data based on a target feature vector of the test data and the target K value by taking the sample data set as a sample in the initial K nearest neighbor model;
and determining the identification accuracy of the initial K nearest neighbor model based on the identification category corresponding to each piece of test data and the label corresponding to each piece of test data.
4. The medical interrogation data processing method of claim 3, wherein the adjusting the preset initial K value in the initial K nearest neighbor model to obtain a target K value comprises:
and reducing a preset initial K value in the initial K nearest neighbor model to obtain a target K value corresponding to the preset initial K value.
5. The medical interrogation data processing method of any of claims 1 to 3, wherein the adjusting the K value in the preset K nearest neighbor model comprises:
acquiring a target K value corresponding to the initial K value in the preset K nearest neighbor model;
adjusting the K value in the preset K nearest neighbor model from the initial K value to the target K value;
and the target K value is obtained by adjusting a preset initial K value corresponding to the preset K nearest neighbor model in the model training process.
6. The medical interrogation data processing method of claim 2, wherein the preset categories include one or more of a description of a condition, a diagnosis result, a medication use, a related examination.
7. The medical interrogation data processing method of claim 1, further comprising:
storing the classification result of the inquiry data to be processed in a database;
when the to-be-processed inquiry data have the mark attribute, selecting other inquiry data which belong to the same inquiry patient as the to-be-processed inquiry data and have the same category from a database;
displaying the other inquiry data in a target client to prompt a user of the target client whether to mark the other inquiry data;
in response to the tagging operation for the other interrogation data, tags are added to the other interrogation data in the database.
8. A medical interrogation data processing apparatus, comprising:
the characteristic extraction module is configured to preprocess the inquiry data to be processed so as to determine a characteristic vector corresponding to the inquiry data to be processed;
the initial classification module is configured to input the feature vectors into a preset K nearest neighbor model so as to determine the top N target classes with the highest frequency of occurrence in the classes to which the K samples adjacent to the to-be-processed inquiry data belong;
a target feature determination module configured to determine a feature value corresponding to the target category in a feature vector corresponding to the to-be-processed inquiry data to generate a target feature vector corresponding to the to-be-processed inquiry data;
and the secondary classification module is configured to adjust a K value in the preset K nearest neighbor model, and input a target feature vector corresponding to the to-be-processed inquiry data into the preset K nearest neighbor model after the K value is adjusted so as to classify the to-be-processed inquiry data.
9. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, implements the medical interrogation data processing method according to any one of claims 1 to 7.
10. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the medical interrogation data processing method of any of claims 1 to 7.
CN202111371847.7A 2021-11-18 2021-11-18 Medical inquiry data processing method and device, readable storage medium and electronic equipment Pending CN114068028A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111371847.7A CN114068028A (en) 2021-11-18 2021-11-18 Medical inquiry data processing method and device, readable storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111371847.7A CN114068028A (en) 2021-11-18 2021-11-18 Medical inquiry data processing method and device, readable storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN114068028A true CN114068028A (en) 2022-02-18

Family

ID=80278145

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111371847.7A Pending CN114068028A (en) 2021-11-18 2021-11-18 Medical inquiry data processing method and device, readable storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN114068028A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116561693A (en) * 2023-05-26 2023-08-08 工业富联(佛山)产业示范基地有限公司 Abnormality determination method for injection molding machine, electronic device and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116561693A (en) * 2023-05-26 2023-08-08 工业富联(佛山)产业示范基地有限公司 Abnormality determination method for injection molding machine, electronic device and storage medium

Similar Documents

Publication Publication Date Title
US11790171B2 (en) Computer-implemented natural language understanding of medical reports
Fan et al. Adverse drug event detection and extraction from open data: A deep learning approach
US20220020495A1 (en) Methods and apparatus for providing guidance to medical professionals
CA3137096A1 (en) Computer-implemented natural language understanding of medical reports
US20140365239A1 (en) Methods and apparatus for facilitating guideline compliance
CN107004212B (en) Modeling actions, results, and goal realizations from social media and other digital tracks
US11023503B2 (en) Suggesting text in an electronic document
US20220114346A1 (en) Multi case-based reasoning by syntactic-semantic alignment and discourse analysis
CN112541056A (en) Medical term standardization method, device, electronic equipment and storage medium
CN111259664B (en) Method, device and equipment for determining medical text information and storage medium
CN111755090A (en) Medical record searching method, medical record searching device, storage medium and electronic equipment
WO2014197669A1 (en) Methods and apparatus for providing guidance to medical professionals
CN113297852B (en) Medical entity word recognition method and device
CN114068028A (en) Medical inquiry data processing method and device, readable storage medium and electronic equipment
US11783244B2 (en) Methods and systems for holistic medical student and medical residency matching
CN116719840A (en) Medical information pushing method based on post-medical-record structured processing
CN116303922A (en) Consultation message response method, consultation message response device, computer equipment, storage medium and product
JP6026036B1 (en) DATA ANALYSIS SYSTEM, ITS CONTROL METHOD, PROGRAM, AND RECORDING MEDIUM
CN114625960A (en) On-line evaluation method and device, electronic equipment and storage medium
CN114461085A (en) Medical input recommendation method, device, equipment and storage medium
CN114664421A (en) Doctor-patient matching method and device, electronic equipment, medium and product
Ghamami et al. Why biomedical relation extraction is an open issue?
CN114242267A (en) Neural network-based inquiry reply method, device, equipment and storage medium
CN113764063A (en) Physical examination report processing method, device, equipment and storage medium
CN110826616A (en) Information processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination