CN111177375A - Electronic document classification method and device - Google Patents

Electronic document classification method and device Download PDF

Info

Publication number
CN111177375A
CN111177375A CN201911295117.6A CN201911295117A CN111177375A CN 111177375 A CN111177375 A CN 111177375A CN 201911295117 A CN201911295117 A CN 201911295117A CN 111177375 A CN111177375 A CN 111177375A
Authority
CN
China
Prior art keywords
document
classified
feature
electronic
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911295117.6A
Other languages
Chinese (zh)
Other versions
CN111177375B (en
Inventor
杨宝山
强晟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yidu Cloud Beijing Technology Co Ltd
Original Assignee
Yidu Cloud Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yidu Cloud Beijing Technology Co Ltd filed Critical Yidu Cloud Beijing Technology Co Ltd
Priority to CN201911295117.6A priority Critical patent/CN111177375B/en
Publication of CN111177375A publication Critical patent/CN111177375A/en
Application granted granted Critical
Publication of CN111177375B publication Critical patent/CN111177375B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention is suitable for the technical field of electronic document processing, and provides an electronic document classification method and a device, wherein the method comprises the following steps: performing word segmentation on the electronic document to be classified to acquire features to be extracted; matching the features to be extracted according to a feature extraction model to obtain a feature vector corresponding to the electronic document to be classified; and processing the characteristic vectors by adopting a machine learning classification algorithm so as to classify the electronic documents to be classified corresponding to the characteristic vectors. The electronic documents to be classified are segmented, the feature vectors are obtained through feature extraction, and the feature vectors are processed by adopting a machine learning classification algorithm, so that the classification of the electronic documents to be classified is realized, the document processing of complex electronic documents is fully considered, the precision of document classification is effectively improved, and the precision of subsequent electronic medical record structuralization is further improved.

Description

Electronic document classification method and device
Technical Field
The invention belongs to the technical field of electronic document processing, and particularly relates to an electronic document classification method and device.
Background
A large proportion of Medical data is the CDA (Clinical documentary architecture) document of natural language records, of which Electronic Medical Record (EMR) is a very important one. The electronic medical record document refers to the digital information such as characters, symbols, charts, graphs, data, images and the like generated by medical staff by using a medical information system in the process of medical activities, and the activity record can be transmitted and reproduced and can be stored and managed by utilizing an informatization means. With the continuous popularization of electronic medical record documents, a large amount of medical data is continuously accumulated in the form of electronic medical record documents.
Under the background of big data era, electronic medical record documents are subjected to data conversion by using a big data technical means, are produced into a uniform data form, break the barrier of data difference in hospitals or between hospitals, and can mine more valuable medical information. The electronic medical record documents are classified as an important link of electronic medical record document production or structurization, and the accurate classification of the electronic medical record documents is beneficial to improving the accuracy of subsequent electronic medical record structurization.
At present, when documents are classified, a supervised learning method is often adopted to train a classification model, however, the characteristics of electronic medical record documents are very complex, and the reliability of classification results obtained by simply adopting supervised learning is not high, so that the performance of the trained classification model is limited, and the accuracy of the classification of the electronic medical record documents is not high.
Disclosure of Invention
In view of this, embodiments of the present invention provide an electronic document classification method, an electronic document classification device, a terminal device, and a computer-readable storage medium, so as to solve the technical problem in the prior art that the accuracy of classifying electronic medical record documents is not high.
A first aspect of an embodiment of the present invention provides an electronic document classification method, including:
performing word segmentation on the electronic document to be classified to acquire features to be extracted;
matching the features to be extracted according to a feature extraction model to obtain a feature vector corresponding to the electronic document to be classified;
and processing the characteristic vectors by adopting a machine learning classification algorithm so as to classify the electronic documents to be classified corresponding to the characteristic vectors.
A second aspect of an embodiment of the present invention provides an electronic document classification apparatus, including:
the word segmentation module is used for segmenting words of the electronic document to be classified so as to obtain features to be extracted;
the feature vector acquisition module is used for matching the features to be extracted according to a feature extraction model so as to acquire feature vectors corresponding to the electronic documents to be classified;
and the classification module is used for processing the characteristic vectors by adopting a machine learning classification algorithm so as to classify the electronic documents to be classified corresponding to the characteristic vectors.
A third aspect of the embodiments of the present invention provides a terminal device, including a memory, a processor, and a computer program stored in the memory and operable on the processor, wherein the processor implements the steps of the method when executing the computer program.
A fourth aspect of embodiments of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method as described above.
Compared with the prior art, the embodiment of the invention has the following beneficial effects: the embodiment of the invention classifies the electronic documents to be classified, obtains the characteristic vectors by characteristic extraction and processes the characteristic vectors by adopting a machine learning classification algorithm, thereby realizing the classification of the electronic documents to be classified, fully considering the document processing of complex electronic documents, avoiding the classification of the electronic documents by simply depending on a supervised learning mode, effectively improving the precision degree of the document classification and further being beneficial to improving the precision degree of the subsequent electronic medical record structurization.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a first flowchart illustrating an implementation of a method for classifying electronic documents according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating a process of obtaining feature vectors in a method for classifying electronic documents according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating document classification in an electronic document classification method according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating a document classification model constructed in the method for classifying electronic documents according to an embodiment of the present invention;
FIG. 5 is a schematic flow chart illustrating a method for classifying electronic documents according to an embodiment of the present invention to obtain a probability that a predetermined feature vector belongs to each document type;
FIG. 6 is a flowchart illustrating a second implementation of the method for classifying electronic documents according to the embodiment of the present invention;
FIG. 7 is a first schematic diagram of an apparatus for classifying electronic documents according to an embodiment of the present invention;
FIG. 8 is a second schematic diagram of an electronic document sorting apparatus according to an embodiment of the present invention;
fig. 9 is a schematic diagram of a terminal device according to an embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
In order to explain the technical means of the present invention, the following description will be given by way of specific examples.
With the increasing popularity of electronic medical record documents, a large amount of medical data is stored in the form of electronic medical record documents, which usually include documents recorded in natural language. Under the background of big data era, electronic medical record documents are subjected to data conversion by using a big data technical means and are produced into a uniform data form, so that barriers of data difference in hospitals or between hospitals are broken, and more valuable medical information is mined.
Before links such as structuring of electronic medical record documents and the like, subdivision of document categories is very important work, and the method is helpful for reducing difficulty of a subsequent structuring process and improving data quality. The document classification refers to classification of a large number of documents, the documents in the same category have similar structural characteristics, and the specific structural characteristics are set according to business needs. The electronic medical record documents of hospitals have different names and meanings, such as "admission record", "discharge record", etc., because of different stages and purposes of diagnosis. In a data system of a hospital, the classification of electronic medical record documents is rarely marked by definite division, that is, the electronic medical record documents are not classified, so that an electronic medical record document classification technology is needed to classify the electronic medical record documents.
However, there is no document classification technology for electronic medical record documents at present, and other document classification technologies use a supervised learning (a process of adjusting parameters of a classification model by using samples of known categories to achieve required performance) method to train the classification model, however, because the characteristics of the electronic medical record documents are very complex, the classification result is not highly reliable when the classification model which is trained by the supervised learning is used for classifying the electronic medical record documents, and thus the accuracy of the classification of the electronic medical record documents is not high.
The embodiment provides a brand-new electronic document classification method, which can be applied to the classification of electronic medical record documents and the classification of other types of electronic documents, and can effectively improve the accuracy of the classification of the electronic documents.
FIG. 1 is a flowchart illustrating an electronic document classification method according to this embodiment. As shown in fig. 1, the electronic document classification method provided by the present embodiment includes:
step S10: and performing word segmentation on the electronic document to be classified to acquire the features to be extracted.
The electronic documents may be any type of electronic documents, and may be, for example, electronic medical record documents from various hospitals. The embodiment takes an electronic medical record document as an example for explanation. The electronic medical record document is generally the patient information recorded by doctors in each hospital, and can record the patient information according to a unified standard or record the patient information according to usual habits. The types of the obtained electronic medical record documents, such as "admission record" and "discharge record", are different according to different stages and purposes of patient treatment, so that the electronic medical record documents need to be classified. Before classification, the content recorded in the electronic medical record document needs to be identified so as to extract the characteristic information in the electronic medical record document.
When the word segmentation is performed on the electronic medical record document, the word segmentation algorithm can be adopted to perform word segmentation on the electronic medical record document, so that a plurality of words after word segmentation can be obtained, namely, the features to be extracted. In the process of word segmentation, punctuation mark word segmentation logic is needed to be used for word segmentation of the electronic medical record document, and the adopted word segmentation algorithm comprises the following steps: grammar and rule based tokenization, comprehension based tokenization, and statistics based tokenization.
The word segmentation method based on grammar and rules is also called mechanical word segmentation method, which is to match the Chinese character string to be analyzed with the entry in a sufficiently large machine dictionary according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful (a word is identified). According to different scanning directions, the character string matching word segmentation method can be divided into forward matching and reverse matching; according to the condition of preferential matching of different lengths, the maximum matching and the minimum matching can be divided; whether the method is combined with the part-of-speech tagging process or not can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging.
Common grammatical and rule-based word segmentation methods include: forward maximum matching (left to right direction); inverse maximum matching (right-to-left direction); least segmentation (minimizing the number of words cut in each sentence). The forward maximum matching method is to separate a segment of character string, wherein the length of the separation is limited, then match the separated sub-character string with the words in the dictionary, if the matching is successful, then carry out the next round of matching until all the character strings are processed, otherwise, remove a word from the end of the sub-character string, then carry out the matching, and so on. The reverse maximum matching method is similar to the forward maximum matching method.
The word segmentation method based on understanding achieves the effect of recognizing words by enabling a computer to simulate the understanding of sentences by people. The basic idea of the word segmentation method based on understanding is to perform syntactic and semantic analysis while segmenting words, and to process ambiguity phenomena by using syntactic information and semantic information.
Word segmentation method based on statistics: a word is formally a stable combination of words, so in this context, the more times adjacent words occur simultaneously, the more likely it is to constitute a word. Therefore, the frequency or probability of the co-occurrence of the characters and the adjacent characters can better reflect the credibility of the words. The co-occurrence information of adjacent co-occurring words in the material is calculated by counting the frequency of their combinations. The mutual presentation information reflects the closeness degree of the combination relation between the Chinese characters, and when the closeness degree is higher than a certain threshold value, the character group can be considered to possibly form a word. The method only needs to count the word group frequency in the corpus without dividing the dictionary, so the method is called a dictionary-free word segmentation method or a statistical word extraction method. In practical application, the statistical word segmentation system can use a basic word segmentation dictionary to perform string matching word segmentation, and simultaneously uses a statistical method to identify some new words, namely, the string frequency statistics and the string matching are combined, so that the characteristics of high matching word segmentation speed and high efficiency are exerted, and the advantages of dictionary-free word segmentation combined with context recognition word generation and automatic ambiguity elimination are utilized.
After word segmentation is performed by a word segmentation algorithm, the electronic medical record document can be split into a plurality of words, and each word is a feature to be extracted. It can be understood that not all the content of an electronic medical record document is useful for classifying the document, and therefore further processing is required for the features to be extracted.
Step S20: and matching the features to be extracted according to a feature extraction model to obtain a feature vector corresponding to the electronic document to be classified.
Since the input of the machine learning classification algorithm requires a vectorized representation of the electronic document, it is necessary to construct a feature vector of the electronic document to be classified. It will be appreciated that each document type has its own rules for distribution of templates or keywords that may constitute a feature set for that document type. Therefore, a corresponding feature extraction model can be constructed for each document type, and then the feature extraction model is adopted to process the features to be extracted of the electronic documents to be classified so as to obtain feature vectors.
Referring to fig. 2, in the present embodiment, the step S20 may include the following steps:
step S201: matching the to-be-extracted features corresponding to the to-be-classified electronic documents with feature set vectors in a feature extraction model; the feature extraction model comprises at least one document type and a feature set vector corresponding to the document type, and is constructed according to preset electronic document knowledge.
As described above, each document type has its own template or keyword distribution rule, and electronic medical record documents are no exception, and have corresponding specifications, and these templates and specifications may constitute preset electronic document knowledge, which may also be referred to as "external knowledge". According to the preset electronic medical record document knowledge, a feature extraction model of the electronic medical record document can be constructed. For example, for a first document type, which contains a plurality of features, the features form a set, thereby forming a feature set; in order to implement vectorization representation, the features in the feature set are sorted in a vector manner, so as to form a feature set vector, and each component in the feature set vector corresponds to one feature of the document type. For the second document type, and so on. By combing all document types, the features of all document types can form a feature set vector. All document types and feature set vectors may constitute a feature extraction model.
And for each electronic medical record document to be classified, obtaining at least one feature to be extracted after word segmentation. It can be understood that not all the features to be extracted can find the matched vector in the feature set vector, nor every electronic medical record document to be classified includes all the features of the corresponding document type, so that the features to be extracted and the components in the feature set vector of every document type are respectively matched, so that the matching condition of the features to be extracted and the feature set vector can be obtained.
If the feature to be extracted is matched with a component in the feature set vector, then:
step S202: in the feature vector corresponding to the feature to be extracted, the component value is a first preset value.
If the feature to be extracted is not matched with the components in the feature set vector, then:
step S203: and in the feature vector corresponding to the feature to be extracted, the component value is a second preset value.
In this embodiment, the first preset value may be 1, and the second preset value may be 0. Of course, in other embodiments, the actual values of the first preset value and the second preset value may be set according to the need, and are not limited to the above-mentioned situations. Through the process, the feature vector corresponding to the electronic document to be classified can be obtained, and therefore document vectorization is achieved.
Step S30: and processing the characteristic vectors by adopting a machine learning classification algorithm so as to classify the electronic documents to be classified corresponding to the characteristic vectors.
After vectorizing the electronic documents to be classified, the feature vectors can be processed by adopting a machine learning classification algorithm, so that the classification of the electronic documents to be classified is realized. The type of the machine learning classification algorithm may be selected according to needs, for example, a Gradient Boosting classification Tree (GBDT) may be used to classify the electronic document to be classified. Of course, in other embodiments, other types of machine learning classification algorithms may be used, and are not limited herein.
The gradient descent tree uses an addition model (i.e. a linear combination of basis functions) and continuously reduces residual errors generated in the training process to achieve the purpose of classifying data. When model training is carried out, through multiple rounds of iteration, each round of iteration generates a weak classifier, and each classifier is trained on the basis of the residual error of the last round of classifier. The requirements for weak classifiers are generally simple enough and are low variance and high variance, and the training process is to continuously improve the accuracy of the final classifier by reducing the variance.
Specifically, referring to fig. 3, step S30 of the present embodiment includes:
step S301: inputting the feature vectors corresponding to the electronic documents to be classified into a document classification model to obtain the probability that the electronic documents to be classified belong to each document type; the document classification model is constructed according to document types of electronic documents, the document classification model comprises at least one classification regression tree, and each document type corresponds to one classification regression tree.
Before processing the feature vectors of the electronic documents to be classified, a document classification model needs to be constructed, and the document classification model is trained to obtain a document classification model meeting the preset performance requirements. Referring to fig. 4, in the present embodiment, the method for constructing the document classification model may include the following steps:
step S3011: and constructing initial classification regression trees, wherein each initial classification regression tree corresponds to one document type of the electronic document.
When a document classification model is constructed, a classification regression tree needs to be constructed according to the number of document types, and generally one document type corresponds to one classification regression tree. The preliminarily constructed classification regression tree is not trained yet, and therefore is recorded as an initial classification regression tree. In this embodiment, the classification regression tree may be selected as a two-classification regression tree, such as admission records and other records, discharge records and other records, and the like.
Step S3012: and training the initial classification regression tree by adopting a preset feature vector to obtain the probability that the preset feature vector belongs to each document type.
The preset feature vector can be the feature vector of the electronic document with the known document type, and the initial classification regression tree is trained by adopting the feature vector with the known document type, so that the judgment of whether the trained classification regression tree meets the performance requirement is facilitated. After the preset feature vector is processed by the initial classification regression tree, the probability that the preset feature component corresponds to each initial classification regression tree, that is, the probability that the preset feature component belongs to each document type, can be obtained.
Step S3013: and acquiring the residual error of the preset feature vector corresponding to each document type according to the probability that the preset feature vector belongs to each document type.
Step S3014: and judging whether the residual error meets a preset condition. The preset condition here can be set as required, for example, whether the residual error is less than a certain preset value.
If the residual error meets the preset condition, it means that the training for the initial classification regression tree has met the preset performance requirement, at this time:
step S3015: and determining the initial classification regression tree as a classification regression tree to construct the document classification model.
If the residual does not satisfy the preset condition, it means that the initial classification regression tree needs to be trained, and therefore, the process needs to return to step S3012.
The following is an example. Considering that the document type is 3 (K ═ 3), and the preset feature vector (denoted as sample X) belongs to the second class, the classification result for sample X can be represented by a three-dimensional vector [0,1,0], where 0 denotes not belonging to the class and 1 denotes belonging to the class. Since sample X already belongs to the second class, the vector component corresponding to the second class is 1, and the other vector components are 0.
For the case of three classes of sample X, essentially 3 classification regression trees are trained simultaneously during each round of training. The first classification regression tree is for the first class of sample X and the input is (X,0), the second classification regression tree is for the second class of sample X and the input is (X,1), the third classification regression tree is for the third class of sample X and the input is (X, 0).
After training sample X, 3 classification regression trees are generated, and the prediction values of X classes are f1(x)、f2(x)、f3(x) Then in such training, the probabilities that sample X belongs to the first class, the second class, and the third class are:
Figure BDA0002320303850000091
Figure BDA0002320303850000092
Figure BDA0002320303850000093
the residuals for the first class, the second class, and the third class can then be found as:
y1=0-P1(x)
y2=0-P2(x)
y3=0-P3(x)
after the residual is obtained, the value of the residual may be compared to a preset condition. If the residual errors do not meet the preset conditions, the training of the initial classification regression tree is required, at the moment, the three residual errors are used as initial values to perform second training on the classification regression tree, and the process is repeated. After m rounds of iteration, the obtained classification regression tree meets the preset performance, at this time, the training is finished, and the obtained document classification model can be used for processing the feature vector.
Step S302: and determining the document type of the electronic document to be classified according to the probability that the electronic document to be classified belongs to each document type.
It is understood that the higher the probability that the electronic document to be classified belongs to a certain document type means that the electronic document to be classified is more likely to belong to the document type, and thus the document type with the highest probability is determined as the document type of the electronic document to be classified.
Referring to fig. 5, further, in the present embodiment, the step S3012 may include the following steps:
step S30121: and selecting a component in the preset feature vector as a node in the initial classification regression tree according to the preset feature vector.
Step S30122: and taking the characteristic value of the component corresponding to the node in each preset characteristic vector as a candidate dividing point of the node, and acquiring a loss value.
Step S30123: and taking the characteristic value corresponding to the loss value meeting a preset condition (for example, the loss value is minimum) as a division point of the node, and acquiring a predicted value of each document type of the preset characteristic vector.
Step S30124: and acquiring the probability of the preset feature vector belonging to each document type according to the predicted value of the preset feature vector belonging to each document type.
For example, the preset feature vector includes a first component, a second component, a third component, and a fourth component, and a specific value of each component is a feature value, so the preset feature vector may be expressed as [ first feature value, second feature value, third feature value, fourth feature value ]. For each electronic document, when the document type is 3 types, and the number of the initial classification regression trees is 3, the number of the preset feature vectors is also 3, and the electronic document belongs to the second type.
Take the first default eigenvector as an example (the second and third default eigenvectors are processed similarly):
constructing a training sample aiming at the first initial classification regression tree, wherein the label is 0;
constructing a training sample aiming at the second initial classification regression tree, wherein the label is 1;
and constructing a training sample aiming at the third initial classification regression tree, wherein the label is 0.
And taking the first component as a node, taking a first characteristic value of the first preset characteristic vector as a dividing point, calculating a value of the loss function, and recording the value as a first loss value.
And taking the first component as a node, taking the first characteristic value of the second preset characteristic vector as a dividing point, calculating the value of the loss function, and recording the value as a second loss value.
And taking the first component as a node, taking the first characteristic value of the third preset characteristic vector as a dividing point, calculating the value of the loss function, and recording the value as a third loss value.
And taking the second component as a node, taking the second characteristic value of the first preset characteristic vector as a dividing point, calculating the value of the loss function, and recording the value as a fourth loss value.
By analogy, 3 × 4-12 loss values can be obtained, the division point with the minimum loss value is selected as the best division point, the predicted values of the preset feature vector belonging to the first document type, the second document type and the third document type are obtained according to the best division point, and the probability of the preset feature vector belonging to the first document type, the second document type and the third document type is obtained according to the predicted values.
Referring to fig. 6, step S40: and matching the classified electronic documents to be classified with preset document knowledge.
To further mention the accuracy of electronic document classification, the document type of the electronic document to be classified may be further processed after being acquired. For example, the template and the specification of the electronic medical record document may constitute preset electronic document knowledge, which may also be referred to as "external knowledge". According to the preset electronic medical record document knowledge, feature set vectors of different document types can be constructed. After the electronic medical record documents to be classified are subjected to document classification in the previous steps, the feature vectors of the electronic medical record documents to be classified can be matched with the feature set vectors of the document types so as to determine whether the classification results are correct or not. It is understood that the external knowledge may be adjusted by adding or deleting according to the situation, so as to adjust the matching degree.
If the matching degree of the classified electronic documents to be classified and the preset document knowledge meets the preset requirement, the document classification is correct, and at the moment:
step S50: determining that the electronic document to be classified is correctly classified;
if the matching degree of the classified electronic document to be classified and the preset document knowledge does not meet the preset requirement (for example, the matching degree is higher than a certain preset value), the document classification is incorrect, and at this time:
step S60: and checking the electronic document to be classified. At this time, manual verification can be performed to correct the document type of the electronic document to be classified, so as to ensure the accuracy of document classification.
The electronic document classification method provided by the embodiment has the beneficial effects that:
(1) the electronic documents to be classified are segmented through the segmentation algorithm, the feature vectors are obtained through feature extraction, and the feature vectors are processed through the machine learning classification algorithm, so that the electronic documents to be classified are classified, the document processing of complex electronic documents is fully considered, the electronic documents are prevented from being classified only by means of supervised learning, the accuracy of document classification is effectively improved, and the accuracy of subsequent electronic medical record structuralization is improved.
(2) The method has the advantages that the classification result of the document is verified by using external knowledge, and the electronic document with inaccurate classification is corrected in a manual verification mode, so that the accuracy of document classification is effectively guaranteed.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
Referring to fig. 7, the present embodiment further provides an electronic document classifying device, which includes a word segmentation module 71, a feature vector obtaining module 72, and a classification module 73. The word segmentation module 71 is configured to segment words of the electronic document to be classified to obtain features to be extracted; the feature vector obtaining module 72 is configured to match the features to be extracted according to a feature extraction model to obtain a feature vector corresponding to the electronic document to be classified; the classification module 73 is configured to process the feature vectors by using a machine learning classification algorithm, so as to classify the electronic documents to be classified corresponding to the feature vectors.
Referring to fig. 8, the electronic document classifying device further includes a matching module 74, a confirming module 75 and a verifying module 76. The matching module 74 matches the classified electronic documents to be classified with the preset document knowledge; the confirming module 75 is configured to determine that the electronic document to be classified is correctly classified when the matching degree of the classified electronic document to be classified and preset document knowledge meets a preset requirement; the checking module 76 is configured to check the electronic document to be classified when the matching degree of the classified electronic document to be classified and the preset document knowledge does not meet the preset requirement.
Fig. 9 is a schematic diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 9, the terminal device 8 of this embodiment includes: a processor 80, a memory 81, and a computer program 82, such as an electronic document classification program, stored in said memory 81 and operable on said processor 80. The processor 80, when executing the computer program 82, implements the steps in the various electronic document classification method embodiments described above, such as the steps S10 through S30 shown in fig. 1. Alternatively, the processor 80, when executing the computer program 82, implements the functions of the modules/units in the above-described device embodiments, such as the functions of the modules 71 to 73 shown in fig. 7.
Illustratively, the computer program 82 may be partitioned into one or more modules/units that are stored in the memory 81 and executed by the processor 80 to implement the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 82 in the terminal device 8.
The terminal device 8 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor 80, a memory 81. Those skilled in the art will appreciate that fig. 9 is merely an example of a terminal device 8 and does not constitute a limitation of terminal device 8 and may include more or fewer components than shown, or some components may be combined, or different components, e.g., the terminal device may also include input-output devices, network access devices, buses, etc.
The Processor 80 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 81 may be an internal storage unit of the terminal device 8, such as a hard disk or a memory of the terminal device 8. The memory 81 may also be an external storage device of the terminal device 8, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 8. Further, the memory 81 may also include both an internal storage unit and an external storage device of the terminal device 8. The memory 81 is used for storing the computer program and other programs and data required by the terminal device. The memory 81 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. An electronic document classification method, comprising:
performing word segmentation on the electronic document to be classified to acquire features to be extracted;
matching the features to be extracted according to a feature extraction model to obtain a feature vector corresponding to the electronic document to be classified;
and processing the characteristic vectors by adopting a machine learning classification algorithm so as to classify the electronic documents to be classified corresponding to the characteristic vectors.
2. The method for classifying electronic documents according to claim 1, wherein said segmenting the electronic documents to be classified to obtain the features to be extracted comprises:
and performing word segmentation on the electronic document to be classified according to a word segmentation algorithm to obtain the features to be extracted, wherein the word segmentation algorithm comprises a word segmentation method based on grammar and rules, a word segmentation method based on understanding and a word segmentation method based on statistics.
3. The method for classifying electronic documents according to claim 1, wherein said matching the features to be extracted according to a feature extraction model to obtain the feature vectors corresponding to the electronic documents to be classified comprises:
matching the to-be-extracted features corresponding to the to-be-classified electronic documents with feature set vectors in a feature extraction model; the feature extraction model comprises at least one document type and a feature set vector corresponding to the document type, and is constructed according to preset electronic document knowledge;
if the feature to be extracted is matched with one component in the feature set vector, the component value in the feature vector corresponding to the feature to be extracted is a first preset value;
if the feature to be extracted is not matched with the component in the feature set vector, the component value in the feature vector corresponding to the feature to be extracted is a second preset value.
4. The method for classifying electronic documents according to claim 1, wherein said processing the feature vectors by using a machine learning classification algorithm to classify the electronic documents to be classified corresponding to the feature vectors comprises:
inputting the feature vectors corresponding to the electronic documents to be classified into a document classification model to obtain the probability that the electronic documents to be classified belong to each document type; the document classification model is constructed according to document types of electronic documents, the document classification model comprises at least one classification regression tree, and each document type corresponds to one classification regression tree;
and determining the document type of the electronic document to be classified according to the probability that the electronic document to be classified belongs to each document type.
5. The method of classifying an electronic document according to claim 4, wherein the manner of constructing the document classification model includes:
constructing initial classification regression trees, wherein each initial classification regression tree corresponds to one document type of the electronic document;
training the initial classification regression tree by adopting a preset feature vector to obtain the probability that the preset feature vector belongs to each document type;
obtaining the residual error of the preset feature vector corresponding to each document type according to the probability that the preset feature vector belongs to each document type;
judging whether the residual error meets a preset condition or not;
if the residual error meets the preset condition, determining the initial classification regression tree as a classification regression tree to construct the document classification model;
and if the residual does not meet the preset condition, returning to the step of constructing the initial classification regression tree.
6. The method of classifying electronic documents according to claim 5, wherein said training said initial classification regression tree with a predetermined feature vector to obtain a probability that said predetermined feature vector belongs to each document type comprises:
selecting a component in the preset feature vector as a node in the initial classification regression tree according to the preset feature vector;
taking the characteristic value of the component corresponding to the node in each preset characteristic vector as a candidate division point of the node, and acquiring a loss value;
taking the characteristic value corresponding to the loss value meeting the preset condition as a division point of the node, and acquiring a predicted value of each document type of the preset characteristic vector;
and acquiring the probability of the preset feature vector belonging to each document type according to the predicted value of the preset feature vector belonging to each document type.
7. The method for classifying electronic documents according to claim 1, wherein said processing said feature vectors by using a machine learning classification algorithm to classify the electronic documents to be classified corresponding to said feature vectors further comprises:
matching the classified electronic documents to be classified with preset document knowledge;
if the matching degree of the classified electronic documents to be classified and preset document knowledge meets a preset requirement, determining that the electronic documents to be classified are correctly classified;
and if the matching degree of the classified electronic documents to be classified and preset document knowledge does not meet the preset requirement, verifying the electronic documents to be classified.
8. An electronic document classification apparatus, comprising:
the word segmentation module is used for segmenting words of the electronic document to be classified so as to obtain features to be extracted;
the feature vector acquisition module is used for matching the features to be extracted according to a feature extraction model so as to acquire feature vectors corresponding to the electronic documents to be classified;
and the classification module is used for processing the characteristic vectors by adopting a machine learning classification algorithm so as to classify the electronic documents to be classified corresponding to the characteristic vectors.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN201911295117.6A 2019-12-16 2019-12-16 Electronic document classification method and device Active CN111177375B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911295117.6A CN111177375B (en) 2019-12-16 2019-12-16 Electronic document classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911295117.6A CN111177375B (en) 2019-12-16 2019-12-16 Electronic document classification method and device

Publications (2)

Publication Number Publication Date
CN111177375A true CN111177375A (en) 2020-05-19
CN111177375B CN111177375B (en) 2023-06-02

Family

ID=70656523

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911295117.6A Active CN111177375B (en) 2019-12-16 2019-12-16 Electronic document classification method and device

Country Status (1)

Country Link
CN (1) CN111177375B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113191134A (en) * 2021-05-31 2021-07-30 平安科技(深圳)有限公司 Document quality verification method, device, equipment and medium based on attention mechanism
CN113505579A (en) * 2021-06-03 2021-10-15 北京达佳互联信息技术有限公司 Document processing method and device, electronic equipment and storage medium
CN114880462A (en) * 2022-02-25 2022-08-09 北京百度网讯科技有限公司 Medical document analysis method, device, equipment and storage medium
CN116701303A (en) * 2023-07-06 2023-09-05 浙江档科信息技术有限公司 Electronic file classification method, system and readable storage medium based on deep learning

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030172357A1 (en) * 2002-03-11 2003-09-11 Kao Anne S.W. Knowledge management using text classification
KR20080045413A (en) * 2006-11-20 2008-05-23 한국전자통신연구원 Method for predicting phrase break using static/dynamic feature and text-to-speech system and method based on the same
US20140156567A1 (en) * 2012-12-04 2014-06-05 Msc Intellectual Properties B.V. System and method for automatic document classification in ediscovery, compliance and legacy information clean-up
CN106126734A (en) * 2016-07-04 2016-11-16 北京奇艺世纪科技有限公司 The sorting technique of document and device
CN107833603A (en) * 2017-11-13 2018-03-23 医渡云(北京)技术有限公司 Electronic medical record document sorting technique, device, electronic equipment and storage medium
WO2018219198A1 (en) * 2017-06-02 2018-12-06 腾讯科技(深圳)有限公司 Man-machine interaction method and apparatus, and man-machine interaction terminal
CN109388712A (en) * 2018-09-21 2019-02-26 平安科技(深圳)有限公司 A kind of trade classification method and terminal device based on machine learning
CN109684272A (en) * 2018-12-29 2019-04-26 国家电网有限公司 Document storage method, system and terminal device
CN109840541A (en) * 2018-12-05 2019-06-04 国网辽宁省电力有限公司信息通信分公司 A kind of network transformer Fault Classification based on XGBoost
CN109858248A (en) * 2018-12-26 2019-06-07 中国科学院信息工程研究所 Malice Word document detection method and device
CN110021439A (en) * 2019-03-07 2019-07-16 平安科技(深圳)有限公司 Medical data classification method, device and computer equipment based on machine learning
CN110209812A (en) * 2019-05-07 2019-09-06 北京地平线机器人技术研发有限公司 File classification method and device
CN110348580A (en) * 2019-06-18 2019-10-18 第四范式(北京)技术有限公司 Construct the method, apparatus and prediction technique, device of GBDT model

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030172357A1 (en) * 2002-03-11 2003-09-11 Kao Anne S.W. Knowledge management using text classification
KR20080045413A (en) * 2006-11-20 2008-05-23 한국전자통신연구원 Method for predicting phrase break using static/dynamic feature and text-to-speech system and method based on the same
US20140156567A1 (en) * 2012-12-04 2014-06-05 Msc Intellectual Properties B.V. System and method for automatic document classification in ediscovery, compliance and legacy information clean-up
CN106126734A (en) * 2016-07-04 2016-11-16 北京奇艺世纪科技有限公司 The sorting technique of document and device
WO2018219198A1 (en) * 2017-06-02 2018-12-06 腾讯科技(深圳)有限公司 Man-machine interaction method and apparatus, and man-machine interaction terminal
CN107833603A (en) * 2017-11-13 2018-03-23 医渡云(北京)技术有限公司 Electronic medical record document sorting technique, device, electronic equipment and storage medium
CN109388712A (en) * 2018-09-21 2019-02-26 平安科技(深圳)有限公司 A kind of trade classification method and terminal device based on machine learning
CN109840541A (en) * 2018-12-05 2019-06-04 国网辽宁省电力有限公司信息通信分公司 A kind of network transformer Fault Classification based on XGBoost
CN109858248A (en) * 2018-12-26 2019-06-07 中国科学院信息工程研究所 Malice Word document detection method and device
CN109684272A (en) * 2018-12-29 2019-04-26 国家电网有限公司 Document storage method, system and terminal device
CN110021439A (en) * 2019-03-07 2019-07-16 平安科技(深圳)有限公司 Medical data classification method, device and computer equipment based on machine learning
CN110209812A (en) * 2019-05-07 2019-09-06 北京地平线机器人技术研发有限公司 File classification method and device
CN110348580A (en) * 2019-06-18 2019-10-18 第四范式(北京)技术有限公司 Construct the method, apparatus and prediction technique, device of GBDT model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
中国环境监测总站: "《二、GDBT原理》", 《环境空气质量预报预警方法技术指南 第2版》 *
牟娜: "《数据挖掘技术在道路阻抗函数问题中的应用研究》", 《中国优秀硕士学位论文全文数据库-工程科技II辑》 *
董祥和;: "基于情感特征向量空间模型的中文商品评论倾向分类算法", 计算机应用与软件 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113191134A (en) * 2021-05-31 2021-07-30 平安科技(深圳)有限公司 Document quality verification method, device, equipment and medium based on attention mechanism
CN113505579A (en) * 2021-06-03 2021-10-15 北京达佳互联信息技术有限公司 Document processing method and device, electronic equipment and storage medium
CN114880462A (en) * 2022-02-25 2022-08-09 北京百度网讯科技有限公司 Medical document analysis method, device, equipment and storage medium
CN116701303A (en) * 2023-07-06 2023-09-05 浙江档科信息技术有限公司 Electronic file classification method, system and readable storage medium based on deep learning
CN116701303B (en) * 2023-07-06 2024-03-12 浙江档科信息技术有限公司 Electronic file classification method, system and readable storage medium based on deep learning

Also Published As

Publication number Publication date
CN111177375B (en) 2023-06-02

Similar Documents

Publication Publication Date Title
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN108399228B (en) Article classification method and device, computer equipment and storage medium
CN111581976B (en) Medical term standardization method, device, computer equipment and storage medium
CN111177375B (en) Electronic document classification method and device
CN110909725A (en) Method, device and equipment for recognizing text and storage medium
CN112270196B (en) Entity relationship identification method and device and electronic equipment
CN112084381A (en) Event extraction method, system, storage medium and equipment
CN111444330A (en) Method, device and equipment for extracting short text keywords and storage medium
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN111126065A (en) Information extraction method and device for natural language text
CN111460170B (en) Word recognition method, device, terminal equipment and storage medium
WO2022222300A1 (en) Open relationship extraction method and apparatus, electronic device, and storage medium
CN112347778A (en) Keyword extraction method and device, terminal equipment and storage medium
CN111309916B (en) Digest extracting method and apparatus, storage medium, and electronic apparatus
CN110210028A (en) For domain feature words extracting method, device, equipment and the medium of speech translation text
CN110929520A (en) Non-named entity object extraction method and device, electronic equipment and storage medium
CN114358001A (en) Method for standardizing diagnosis result, and related device, equipment and storage medium thereof
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN111241271B (en) Text emotion classification method and device and electronic equipment
Zhang et al. A structural SVM approach for reference parsing
Wong et al. isentenizer-: Multilingual sentence boundary detection model
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN110263345A (en) Keyword extracting method, device and storage medium
CN115858776A (en) Variant text classification recognition method, system, storage medium and electronic equipment
CN115146025A (en) Question and answer sentence classification method, terminal equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant