CN111177375B - Electronic document classification method and device - Google Patents

Electronic document classification method and device Download PDF

Info

Publication number
CN111177375B
CN111177375B CN201911295117.6A CN201911295117A CN111177375B CN 111177375 B CN111177375 B CN 111177375B CN 201911295117 A CN201911295117 A CN 201911295117A CN 111177375 B CN111177375 B CN 111177375B
Authority
CN
China
Prior art keywords
document
feature
classified
preset
electronic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911295117.6A
Other languages
Chinese (zh)
Other versions
CN111177375A (en
Inventor
杨宝山
强晟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yidu Cloud Beijing Technology Co Ltd
Original Assignee
Yidu Cloud Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yidu Cloud Beijing Technology Co Ltd filed Critical Yidu Cloud Beijing Technology Co Ltd
Priority to CN201911295117.6A priority Critical patent/CN111177375B/en
Publication of CN111177375A publication Critical patent/CN111177375A/en
Application granted granted Critical
Publication of CN111177375B publication Critical patent/CN111177375B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention is applicable to the technical field of electronic document processing, and provides an electronic document classification method and device, wherein the method comprises the following steps: word segmentation is carried out on the electronic documents to be classified so as to obtain characteristics to be extracted; matching the features to be extracted according to a feature extraction model to obtain feature vectors corresponding to the electronic documents to be classified; and processing the feature vector by adopting a machine learning classification algorithm to classify the electronic document to be classified corresponding to the feature vector. According to the invention, the electronic documents to be classified are segmented, the feature vectors are obtained through feature extraction, and the feature vectors are processed by adopting a machine learning classification algorithm, so that the classification of the electronic documents to be classified is realized, the document processing of complex electronic documents is fully considered, the precision degree of document classification is effectively improved, and the precision degree of the subsequent electronic medical record structuring is further improved.

Description

Electronic document classification method and device
Technical Field
The invention belongs to the technical field of electronic document processing, and particularly relates to an electronic document classification method and device.
Background
A significant proportion of medical data is CDA (Clinical Document Architecture) documents recorded in natural language, with electronic medical records (Electronic Medical Record, abbreviated EMR) being a very important CDA document. The electronic medical record document refers to the digitalized information such as characters, symbols, charts, graphs, data, images and the like generated by medical staff in the medical activity process by using a medical information system, and the activity record can be transmitted and reproduced and is stored and managed by using an informatization means. With the continuous popularization of electronic medical record documents, a large amount of medical data is continuously accumulated in the form of electronic medical record documents.
Under the background of big data age, the technical means of big data is used for converting the data of the electronic medical record file, the electronic medical record file is produced into a unified data form, the barriers of data difference in hospitals or among hospitals are broken, and more valuable medical information can be mined. The classification of the electronic medical record documents is an important link in the production or structuring of the electronic medical record documents, and the accurate classification of the electronic medical record documents is beneficial to improving the accuracy of the structuring of the subsequent electronic medical records.
At present, when documents are classified, a supervised learning method is often adopted to train the classification model, however, as the characteristics of the electronic medical record documents are quite complex, the reliability of the classification result simply adopting the supervised learning is not high, so that the performance of the trained classification model is limited, and the accuracy of the classification of the electronic medical record documents is not high.
Disclosure of Invention
In view of the above, the embodiments of the present invention provide a method, an apparatus, a terminal device, and a computer readable storage medium for classifying electronic documents, so as to solve the technical problem that the accuracy of classifying electronic medical record documents in the prior art is not high.
A first aspect of an embodiment of the present invention provides an electronic document classification method, including:
word segmentation is carried out on the electronic documents to be classified so as to obtain characteristics to be extracted;
matching the features to be extracted according to a feature extraction model to obtain feature vectors corresponding to the electronic documents to be classified;
and processing the feature vector by adopting a machine learning classification algorithm to classify the electronic document to be classified corresponding to the feature vector.
A second aspect of an embodiment of the present invention provides an electronic document classification apparatus, including:
the word segmentation module is used for segmenting the electronic documents to be classified to obtain characteristics to be extracted;
the feature vector acquisition module is used for matching the features to be extracted according to a feature extraction model so as to acquire feature vectors corresponding to the electronic documents to be classified;
and the classification module is used for processing the feature vectors by adopting a machine learning classification algorithm so as to classify the electronic documents to be classified corresponding to the feature vectors.
A third aspect of the embodiments of the present invention provides a terminal device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, characterized in that said processor implements the steps of the method as described above when said computer program is executed.
A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as described above.
Compared with the prior art, the embodiment of the invention has the beneficial effects that: according to the embodiment of the invention, the electronic documents to be classified are classified by word segmentation, the feature vectors are obtained through feature extraction, and the feature vectors are processed by adopting a machine learning classification algorithm, so that the classification of the electronic documents to be classified is realized, the document processing of complex electronic documents is fully considered, the classification of the electronic documents by purely depending on a supervised learning mode is avoided, the accuracy degree of document classification is effectively improved, and the accuracy degree of the subsequent electronic medical record structuring is further improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an implementation flow of an electronic document classification method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of obtaining feature vectors in the electronic document classification method according to the embodiment of the invention;
FIG. 3 is a schematic flow chart of document classification in the electronic document classification method according to the embodiment of the invention;
FIG. 4 is a schematic flow chart of constructing a document classification model in the electronic document classification method according to the embodiment of the invention;
FIG. 5 is a schematic flow chart of obtaining probability that a preset feature vector belongs to each document type in the electronic document classification method according to the embodiment of the invention;
FIG. 6 is a second schematic diagram of an implementation flow of the electronic document classification method according to the embodiment of the present invention;
FIG. 7 is a schematic diagram I of an electronic document classification apparatus according to an embodiment of the present invention;
FIG. 8 is a second schematic diagram of an electronic document classification apparatus according to an embodiment of the present invention;
fig. 9 is a schematic diagram of a terminal device provided in an embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
In order to illustrate the technical scheme of the invention, the following description is made by specific examples.
With the increasing popularity of electronic medical record documents, a large amount of medical data is stored in the form of electronic medical record documents, which typically include documents recorded in natural language. Under the background of big data age, the technical means of big data is used for converting the data of the electronic medical record file, and the electronic medical record file is produced into a unified data form, so that the barriers of data difference in hospitals or among hospitals can be broken, and more valuable medical information can be mined.
Before links such as structuring the electronic medical record document, subdivision of document categories is an important task, which is helpful for reducing difficulty in subsequent structuring process and improving data quality. The document classification refers to classifying a large number of documents, wherein the documents in the same class have similar structural characteristics, and the specific structural characteristics are set according to service requirements. The electronic medical record file of the hospital has different names and meanings, such as 'admission record', 'discharge record', and the like, because of different stages and purposes of diagnosis. The classification of the electronic medical record documents is seldom marked by clear division in a data system of a hospital, namely the electronic medical record documents are not classified, so that an electronic medical record document classification technology is needed to classify the electronic medical record documents.
However, at present, no document classification technology for electronic medical record documents exists, other document classification technologies adopt a method of supervised learning (a process of adjusting parameters of a classification model by using samples of known types to enable the parameters to reach required performances) to train the classification model, however, as characteristics of the electronic medical record documents are very complex, the classification result obtained by simply adopting the classification model trained by supervised learning for classification of the electronic medical record documents is not high in reliability, so that the accuracy of classification of the electronic medical record documents is not high.
The embodiment provides a brand new electronic document classification method, which can be applied to classification of electronic medical record documents and classification of other types of electronic documents, and can effectively improve the accuracy of electronic document classification.
Fig. 1 is a flowchart of a method for classifying electronic documents according to the present embodiment. As shown in fig. 1, the electronic document classification method provided in the present embodiment includes:
step S10: and segmenting the electronic documents to be classified to obtain the characteristics to be extracted.
The electronic document may be any type of electronic document, for example, an electronic medical record document from various hospitals. The present embodiment is described by taking an electronic medical record document as an example. The electronic medical record document is usually the doctor of each hospital records the patient's treatment information, and can record the patient's treatment information according to the unified standard, also can record the patient's treatment information according to the usual habit. The types of the obtained electronic medical record documents are different according to the different stages and purposes of patient consultation, such as "admission records", "discharge records" and the like, so that the electronic medical record documents need to be classified. Before classification, the content recorded in the electronic medical record document needs to be identified so as to extract the characteristic information.
When the electronic medical record document is segmented, the electronic medical record document can be segmented by a segmentation algorithm, so that a plurality of segmented words, namely to-be-extracted features, can be obtained. In the process of word segmentation, the punctuation mark word segmentation logic is required to be used for word segmentation of the electronic medical record document, and the adopted word segmentation algorithm comprises the following steps: grammar and rule based lexical, understanding based lexical, and statistics based lexical.
The grammar and rule-based word segmentation method is also called mechanical word segmentation method, and is to match the Chinese character string to be analyzed with the vocabulary entry in a 'sufficiently large' machine dictionary according to a certain strategy, and if a certain character string is found in the dictionary, the matching is successful (a word is identified). According to different scanning directions, the character string matching word segmentation method can be divided into forward matching and reverse matching; according to the situation of preferential matching of different lengths, the matching can be divided into maximum matching and minimum matching; according to the combination of the part-of-speech labeling process or not, the method can be divided into a simple word segmentation method and an integrated method combining word segmentation and labeling.
Common grammar and rule-based word segmentation methods include: forward maximum matching (left to right direction); reverse maximum matching (right-to-left direction); minimum segmentation (minimizing the number of words cut in each sentence). The forward maximum matching method is to separate a segment of character string, wherein the separation length is limited, then match the separated sub-character string with the words in the dictionary, if the matching is successful, then match the next round until all character strings are processed, otherwise remove a word from the end of the sub-character string, and then match the sub-character string, and repeating the steps. The reverse maximum matching method is similar to this forward maximum matching method.
The word segmentation method based on understanding achieves the effect of word recognition by enabling a computer to simulate the understanding of a sentence by a person. The basic idea of the word segmentation method based on understanding is that syntax and semantic analysis is performed while word segmentation is performed, and the syntax information and the semantic information are utilized to process ambiguity.
Statistical-based word segmentation: formally, words are stable combinations of words, and therefore in this context, the more times adjacent words appear simultaneously, the more likely a word is composed. Therefore, the frequency or probability of co-occurrence of the characters adjacent to the characters can better reflect the credibility of the formed words. The frequency of the combination of each word of adjacent co-occurrence in the corpus is counted to calculate the co-occurrence information of each word. The mutual information shows the tightness of the combination relation between Chinese characters, and when the tightness is higher than a certain threshold value, the character group can be considered to form a word. The method only needs to count the word group frequency in the corpus, and does not need to split a dictionary, so the method is also called a dictionary-free word segmentation method or a statistical word extraction method. In practical application, the statistical word segmentation system can use a part of basic word segmentation dictionary to carry out string matching word segmentation, and simultaneously uses a statistical method to identify some new words, namely, the string frequency statistics and the string matching are combined, so that the characteristics of high speed and high efficiency of word segmentation by matching are brought into play, and the advantages of word segmentation combination without dictionary, word generation identification by context and automatic disambiguation are utilized.
After word segmentation by a word segmentation algorithm, the electronic medical record document can be split into a plurality of words, and each word is a feature to be extracted. It will be appreciated that not all of the content in an electronic medical record document is useful for document classification, and thus further processing of the features to be extracted is required.
Step S20: and matching the features to be extracted according to a feature extraction model to obtain feature vectors corresponding to the electronic documents to be classified.
Since the input of the machine learning classification algorithm requires a vectorized representation of the electronic document, it is necessary to construct a feature vector of the electronic document to be classified. It will be appreciated that each document type has its own templates or keywords which may form a feature set for that document type. Therefore, a corresponding feature extraction model can be constructed for each document type, and then the feature extraction model is adopted to process the feature to be extracted of the electronic document to be classified so as to obtain a feature vector.
Referring to fig. 2, in the present embodiment, step S20 may include the following steps:
step S201: matching the to-be-extracted features corresponding to the to-be-classified electronic documents with feature set vectors in a feature extraction model; the feature extraction model comprises at least one document type and a feature set vector corresponding to the document type, and is constructed according to preset electronic document knowledge.
As described above, each document type has its own template or keyword distribution rule, and the electronic medical record document has corresponding specifications, and these templates and specifications may constitute preset electronic document knowledge, which may also be referred to as "external knowledge". According to the knowledge of the preset electronic medical record documents, a characteristic extraction model of the electronic medical record documents can be constructed. For example, for a first document type, it contains a plurality of features that form a collection, thereby forming a feature collection; to implement the vectorized representation, the features in the feature set are sorted in a vector manner to form a feature set vector, each component in the feature set vector corresponding to a feature of the document type. And so on for the second document type. By carding all document types, the features of all document types can form a feature set vector. All document types and feature set vectors may constitute a feature extraction model.
And for each electronic medical record document to be classified, obtaining at least one feature to be extracted after word segmentation. It can be understood that not all the features to be extracted can find the vector matched with the feature set vector, nor does each electronic medical record document to be classified include all the features of the corresponding document type, so that the matching condition of the feature to be extracted and the feature set vector can be obtained by respectively matching the feature to be extracted with the components in the feature set vector of each document type.
If the feature to be extracted is matched with a component in the feature set vector, then:
step S202: in the feature vector corresponding to the feature to be extracted, the component value is a first preset value.
If the feature to be extracted is not matched with the components in the feature set vector, then:
step S203: in the feature vector corresponding to the feature to be extracted, the component value is a second preset value.
In this embodiment, the first preset value may be 1, and the second preset value may be 0. Of course, in other embodiments, the actual values of the first preset value and the second preset value may be set as required, and are not limited to the above case. Through the process, the feature vector corresponding to the electronic document to be classified can be obtained, so that the vectorization of the document is realized.
Step S30: and processing the feature vector by adopting a machine learning classification algorithm to classify the electronic document to be classified corresponding to the feature vector.
After vectorization is carried out on the electronic documents to be classified, a machine learning classification algorithm can be adopted to process the feature vectors, so that classification of the electronic documents to be classified is realized. The type of machine learning classification algorithm may be selected as desired, for example, a machine learning algorithm gradient descent tree (Gradient Boosting Decision Tree, abbreviated GBDT) may be employed to classify the electronic document to be classified. Of course, in other embodiments, other types of machine learning classification algorithms may be employed, without limitation.
The gradient descent tree adopts an addition model (i.e. linear combination of basis functions) to continuously reduce residuals generated in the training process so as to achieve the aim of classifying data. When model training is performed, each round of iteration generates a weak classifier, and each classifier is trained on the basis of the residual errors of the previous round of classifier. The requirements for weak classifiers are generally simple enough and low variance and high bias, the training process is to continually increase the accuracy of the final classifier by reducing bias.
Specifically, referring to fig. 3, step S30 of the present embodiment includes:
step S301: inputting the feature vector corresponding to the electronic document to be classified into a document classification model to obtain the probability that the electronic document to be classified belongs to each document type; the document classification model is constructed according to document types of the electronic document, and comprises at least one classification regression tree, wherein each document type corresponds to one classification regression tree.
Before processing the feature vector of the electronic document to be classified, a document classification model needs to be built, and the document classification model is trained to obtain the document classification model meeting the preset performance requirement. Referring to fig. 4, in this embodiment, the method for constructing the document classification model may include the following steps:
step S3011: and constructing initial classification regression trees, wherein each initial classification regression tree corresponds to one document type of the electronic document.
In constructing a document classification model, classification regression trees need to be constructed according to the number of document types, typically one for each document type. The initially constructed classification regression tree has not been trained and is therefore noted as the initial classification regression tree. In this embodiment, the classification regression tree may be selected as a classification regression tree, such as an admission record and other records, an discharge record and other records, and the like.
Step S3012: training the initial classification regression tree by adopting a preset feature vector to obtain the probability that the preset feature vector belongs to each document type.
The preset feature vector may be a feature vector of an electronic document of a known document type, and training the initial classification regression tree with the feature vector of the known document type is helpful to determine whether the trained classification regression tree meets the performance requirement. After the preset feature vector is processed by the initial classification regression tree, the probability that the preset feature component corresponds to each initial classification regression tree, namely the probability that the preset feature component belongs to each document type, can be obtained.
Step S3013: and obtaining residual errors of the preset feature vector corresponding to each document type according to the probability that the preset feature vector belongs to each document type.
Step S3014: and judging whether the residual error meets a preset condition or not. The preset conditions herein may be set as desired, for example, whether the residual is less than a certain preset value.
If the residual meets the preset condition, it means that the training of the initial classification regression tree already meets the preset performance requirement, and at this time:
step S3015: and determining the initial classification regression tree as a classification regression tree to construct the document classification model.
If the residual does not meet the preset condition, it means that the initial classification regression tree still needs to be trained, and thus the step S3012 needs to be returned.
The following is an example. Considering that the document type is 3 (k=3), the preset feature vector (denoted as sample X) belongs to the second class, and then the classification result for sample X can be represented by a three-dimensional vector [0,1,0], where 0 represents that the document type does not belong to the class, and 1 represents that the document type belongs to the class. Since sample X already belongs to the second class, the corresponding vector component of the second class is 1 and the other vector components are 0.
For the case of sample X, there are three classes, essentially 3 classification regression trees are trained simultaneously at each round of training. The first classification regression tree is for the first class of sample X, the input is (X, 0), the second classification regression tree is for the second class of sample X, the input is (X, 1), and the third classification regression tree is for the third class of sample X, the input is (X, 0).
After training sample X, 3 classification regression trees are generated, and the predicted values of the classes of X are f respectively 1 (x)、f 2 (x)、f 3 (x) Then in such training, the probabilities for sample X belonging to the first, second, and third classes are respectively:
Figure BDA0002320303850000091
Figure BDA0002320303850000092
Figure BDA0002320303850000093
then, residuals for the first class, the second class, and the third class can be found as follows:
y 1 =0-P 1 (x)
y 2 =0-P 2 (x)
y 3 =0-P 3 (x)
after the residual error is obtained, the value of the residual error may be compared with a preset condition. If the residuals do not meet the preset condition, the initial classification regression tree still needs to be trained, at the moment, the three residuals are used as initial values to train the classification regression tree for the second time, and the process is repeated. After m rounds of iteration, the obtained classification regression tree meets the preset performance, which means that training is finished, and the obtained document classification model can be used for processing the feature vectors.
Step S302: and determining the document type of the electronic document to be classified according to the probability that the electronic document to be classified belongs to each document type.
It will be appreciated that the higher the probability that an electronic document to be classified belongs to a certain document type, the more likely the electronic document to be classified belongs to that document type, and thus the document type with the highest probability is determined as the document type of the electronic document to be classified.
Referring to fig. 5, further, in this embodiment, step S3012 may include the following steps:
step S30121: and selecting a component in the preset feature vector as a node in the initial classification regression tree according to the preset feature vector.
Step S30122: and taking the characteristic value of the component corresponding to the node in each preset characteristic vector as a candidate dividing point of the node, and acquiring a loss value.
Step S30123: and taking a characteristic value corresponding to the loss value meeting a preset condition (for example, the loss value is the smallest) as a dividing point of the node, and acquiring a predicted value of the preset characteristic vector belonging to each document type.
Step S30124: and obtaining the probability that the preset feature vector belongs to each document type according to the predicted value that the preset feature vector belongs to each document type.
For example, the preset feature vector includes, a first component, a second component, a third component, and a fourth component, and a specific value of each component is a feature value, so the preset feature vector may be expressed as [ a first feature value, a second feature value, a third feature value, a fourth feature value ]. For each electronic document, when the document type is 3 types, the number of the initial classification regression trees is 3, and when the number of the preset feature vectors is also 3, the electronic document belongs to the second type.
Taking the first preset feature vector as an example (the second preset feature vector and the third preset feature vector are processed in a similar manner):
constructing a training sample aiming at a first initial classification regression tree, wherein the label is 0;
constructing a training sample aiming at a second initial classification regression tree, wherein the label is 1;
training samples were constructed for the third initial classification regression tree, with a label of 0.
And taking the first component as a node, taking a first characteristic value of a first preset characteristic vector as a dividing point, calculating a value of a loss function, and recording the value as a first loss value.
And taking the first component as a node, taking the first characteristic value of the second preset characteristic vector as a dividing point, calculating the value of the loss function, and recording the value as a second loss value.
And taking the first component as a node, taking the first characteristic value of the third preset characteristic vector as a dividing point, calculating the value of the loss function, and recording the value as a third loss value.
And taking the second component as a node, taking the second characteristic value of the first preset characteristic vector as a dividing point, calculating the value of the loss function, and recording the value as a fourth loss value.
And so on, 3 x 4 = 12 loss values can be obtained, a partition point with the smallest loss value is selected as the optimal partition point, the predicted values of the preset feature vector belonging to the first document type, the second document type and the third document type are obtained according to the optimal partition point, and the probability of the preset feature vector belonging to the first document type, the second document type and the third document type is obtained according to the predicted values.
Referring to fig. 6, step S40: and matching the classified electronic documents to be classified with preset document knowledge.
To further mention the accuracy of the classification of electronic documents, the document types of the electronic documents to be classified may be further processed after they are acquired. For example, templates and specifications of electronic medical record documents may constitute preset electronic document knowledge, which may also be referred to as "external knowledge". According to the knowledge of the preset electronic medical record documents, feature set vectors of different document types can be constructed. After the electronic medical record document to be classified is subjected to document classification through the steps, the feature vector of the electronic medical record document to be classified can be matched with the feature set vector of the document type so as to determine whether the classification result is correct. It will be appreciated that the external knowledge may be adjusted by adding or subtracting, etc., as appropriate, to adjust the degree of matching.
If the matching degree of the classified electronic documents to be classified and the preset document knowledge meets the preset requirement, the document classification is correct, and at the moment:
step S50: determining that the electronic document to be classified is correctly classified;
if the matching degree of the classified electronic document to be classified and the preset document knowledge does not meet the preset requirement (for example, the matching degree is higher than a certain preset value), the document classification error is meant, and at this time:
step S60: and checking the electronic documents to be classified. At the moment, manual verification can be performed to correct the document type of the electronic document to be classified, so that the accuracy degree of document classification is ensured.
The electronic document classification method provided by the embodiment has the beneficial effects that:
(1) According to the embodiment, the electronic documents to be classified are segmented through the segmentation algorithm, the feature vectors are obtained through feature extraction, and the feature vectors are processed through the machine learning classification algorithm, so that the electronic documents to be classified are classified, the document processing of complex electronic documents is fully considered, the electronic documents are prevented from being classified in a purely supervised learning mode, the accuracy degree of document classification is effectively improved, and the follow-up electronic medical record structuring accuracy degree is further improved.
(2) And the result of document classification is checked by using external knowledge, and electronic documents with inaccurate classification are corrected in a manual check mode, so that the accuracy degree of document classification is effectively ensured.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.
Referring to fig. 7, the present embodiment further provides an electronic document classification apparatus, which includes a word segmentation module 71, a feature vector obtaining module 72, and a classification module 73. The word segmentation module 71 is used for segmenting the electronic document to be classified to obtain the feature to be extracted; the feature vector obtaining module 72 is configured to match the feature to be extracted according to a feature extraction model, so as to obtain a feature vector corresponding to the electronic document to be classified; the classification module 73 is configured to process the feature vector by using a machine learning classification algorithm, so as to classify the electronic document to be classified corresponding to the feature vector.
Referring to fig. 8, the electronic document classification apparatus further includes a matching module 74, a confirmation module 75, and a verification module 76. Wherein the matching module 74 matches the classified electronic document to be classified with preset document knowledge; the confirmation module 75 is configured to determine that the electronic document to be classified is correctly classified when the matching degree of the classified electronic document to be classified and the preset document knowledge meets the preset requirement; the verification module 76 is configured to verify the electronic document to be classified when the matching degree between the classified electronic document to be classified and the preset document knowledge does not meet the preset requirement.
Fig. 9 is a schematic diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 9, the terminal device 8 of this embodiment includes: a processor 80, a memory 81 and a computer program 82, such as an electronic document classification program, stored in the memory 81 and executable on the processor 80. The processor 80, when executing the computer program 82, implements the steps of the respective electronic document classification method embodiments described above, such as steps S10 to S30 shown in fig. 1. Alternatively, the processor 80, when executing the computer program 82, performs the functions of the modules/units of the apparatus embodiments described above, such as the functions of the modules 71 to 73 shown in fig. 7.
By way of example, the computer program 82 may be partitioned into one or more modules/units that are stored in the memory 81 and executed by the processor 80 to complete the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing the specified functions describing the execution of the computer program 82 in the terminal device 8.
The terminal device 8 may be a computing device such as a desktop computer, a notebook computer, a palm computer, a cloud server, etc. The terminal device may include, but is not limited to, a processor 80, a memory 81. It will be appreciated by those skilled in the art that fig. 9 is merely an example of the terminal device 8 and does not constitute a limitation of the terminal device 8, and may include more or less components than illustrated, or may combine certain components, or different components, e.g., the terminal device may further include an input-output device, a network access device, a bus, etc.
The processor 80 may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 81 may be an internal storage unit of the terminal device 8, such as a hard disk or a memory of the terminal device 8. The memory 81 may be an external storage device of the terminal device 8, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal device 8. Further, the memory 81 may also include both an internal storage unit and an external storage device of the terminal device 8. The memory 81 is used for storing the computer program as well as other programs and data required by the terminal device. The memory 81 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other manners. For example, the apparatus/terminal device embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical function division, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.
The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims (9)

1. An electronic document classification method, comprising:
word segmentation is carried out on the electronic documents to be classified so as to obtain characteristics to be extracted;
matching the features to be extracted according to a feature extraction model to obtain feature vectors corresponding to the electronic documents to be classified; the feature extraction model comprises at least one document type and a feature set vector corresponding to the document type, and is constructed according to preset document knowledge, wherein the preset document knowledge comprises a template and/or a keyword distribution rule corresponding to the document;
processing the feature vector by adopting a machine learning classification algorithm to classify the electronic document to be classified corresponding to the feature vector;
the matching of the feature to be extracted according to the feature extraction model to obtain the feature vector corresponding to the electronic document to be classified includes:
matching the to-be-extracted features corresponding to the to-be-classified electronic documents with feature set vectors in a feature extraction model;
if the feature to be extracted is matched with a component in the feature set vector, a component value corresponding to the component in the feature vector corresponding to the feature to be extracted is a first preset value;
if the feature to be extracted is not matched with the component in the feature set vector, the component value corresponding to the component in the feature vector corresponding to the feature to be extracted is a second preset value.
2. The electronic document classification method of claim 1, wherein the word segmentation of the electronic document to be classified to obtain the feature to be extracted comprises:
and performing word segmentation on the electronic documents to be classified according to a word segmentation algorithm to obtain features to be extracted, wherein the word segmentation algorithm comprises a word segmentation method based on grammar and rules, an understanding-based word segmentation method and a statistics-based word segmentation method.
3. The method for classifying electronic documents according to claim 1, wherein the processing the feature vectors by using a machine learning classification algorithm to classify the electronic documents to be classified corresponding to the feature vectors includes:
inputting the feature vector corresponding to the electronic document to be classified into a document classification model to obtain the probability that the electronic document to be classified belongs to each document type; the document classification model is constructed according to document types of the electronic document, and comprises at least one classification regression tree, wherein each document type corresponds to one classification regression tree;
and determining the document type of the electronic document to be classified according to the probability that the electronic document to be classified belongs to each document type.
4. The electronic document classification method of claim 3, wherein the manner in which the document classification model is constructed comprises:
constructing initial classification regression trees, wherein each initial classification regression tree corresponds to one document type of an electronic document;
training the initial classification regression tree by adopting a preset feature vector to obtain the probability that the preset feature vector belongs to each document type;
obtaining residual errors of the preset feature vectors corresponding to the document types according to the probability that the preset feature vectors belong to the document types;
judging whether the residual error meets a preset condition or not;
if the residual meets the preset condition, determining the initial classification regression tree as a classification regression tree to construct the document classification model;
and if the residual error does not meet the preset condition, returning to the training step of the initial classification regression tree by adopting the preset feature vector.
5. The method of claim 4, wherein training the initial classification regression tree with a predetermined feature vector to obtain a probability that the predetermined feature vector belongs to each document type comprises:
selecting a component in the preset feature vector as a node in the initial classification regression tree according to the preset feature vector;
taking the characteristic value of the component corresponding to the node in each preset characteristic vector as a candidate dividing point of the node, and acquiring a loss value;
taking the characteristic value corresponding to the loss value meeting the preset condition as a dividing point of the node, and acquiring a predicted value of each document type of the preset characteristic vector;
and obtaining the probability that the preset feature vector belongs to each document type according to the predicted value that the preset feature vector belongs to each document type.
6. The method for classifying electronic documents according to claim 1, wherein after the step of classifying the electronic documents to be classified corresponding to the feature vectors by processing the feature vectors using a machine learning classification algorithm, the method further comprises:
matching the classified electronic documents to be classified with preset document knowledge;
if the matching degree of the classified electronic documents to be classified and the preset document knowledge meets the preset requirement, determining that the electronic documents to be classified are correctly classified;
and if the matching degree of the classified electronic documents to be classified and the preset document knowledge does not meet the preset requirement, checking the electronic documents to be classified.
7. An electronic document classification apparatus, comprising:
the word segmentation module is used for segmenting the electronic documents to be classified to obtain characteristics to be extracted;
the feature vector acquisition module is used for matching the features to be extracted according to a feature extraction model so as to acquire feature vectors corresponding to the electronic documents to be classified; the feature extraction model comprises at least one document type and a feature set vector corresponding to the document type, and is constructed according to preset document knowledge, wherein the preset document knowledge comprises a template and/or a keyword distribution rule corresponding to the document; the matching of the feature to be extracted according to the feature extraction model to obtain the feature vector corresponding to the electronic document to be classified includes: matching the to-be-extracted features corresponding to the to-be-classified electronic documents with feature set vectors in a feature extraction model; if the feature to be extracted is matched with a component in the feature set vector, a component value corresponding to the component in the feature vector corresponding to the feature to be extracted is a first preset value; if the feature to be extracted is not matched with the component in the feature set vector, the component value corresponding to the component in the feature vector corresponding to the feature to be extracted is a second preset value;
and the classification module is used for processing the feature vectors by adopting a machine learning classification algorithm so as to classify the electronic documents to be classified corresponding to the feature vectors.
8. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 6 when the computer program is executed.
9. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 6.
CN201911295117.6A 2019-12-16 2019-12-16 Electronic document classification method and device Active CN111177375B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911295117.6A CN111177375B (en) 2019-12-16 2019-12-16 Electronic document classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911295117.6A CN111177375B (en) 2019-12-16 2019-12-16 Electronic document classification method and device

Publications (2)

Publication Number Publication Date
CN111177375A CN111177375A (en) 2020-05-19
CN111177375B true CN111177375B (en) 2023-06-02

Family

ID=70656523

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911295117.6A Active CN111177375B (en) 2019-12-16 2019-12-16 Electronic document classification method and device

Country Status (1)

Country Link
CN (1) CN111177375B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113191134B (en) * 2021-05-31 2023-02-03 平安科技(深圳)有限公司 Document quality verification method, device, equipment and medium based on attention mechanism
CN113505579A (en) * 2021-06-03 2021-10-15 北京达佳互联信息技术有限公司 Document processing method and device, electronic equipment and storage medium
CN114880462A (en) * 2022-02-25 2022-08-09 北京百度网讯科技有限公司 Medical document analysis method, device, equipment and storage medium
CN116701303B (en) * 2023-07-06 2024-03-12 浙江档科信息技术有限公司 Electronic file classification method, system and readable storage medium based on deep learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018219198A1 (en) * 2017-06-02 2018-12-06 腾讯科技(深圳)有限公司 Man-machine interaction method and apparatus, and man-machine interaction terminal
CN109684272A (en) * 2018-12-29 2019-04-26 国家电网有限公司 Document storage method, system and terminal device
CN109858248A (en) * 2018-12-26 2019-06-07 中国科学院信息工程研究所 Malice Word document detection method and device

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7673234B2 (en) * 2002-03-11 2010-03-02 The Boeing Company Knowledge management using text classification
KR100835374B1 (en) * 2006-11-20 2008-06-04 한국전자통신연구원 Method for predicting phrase break using static/dynamic feature and Text-to-Speech System and method based on the same
US9235812B2 (en) * 2012-12-04 2016-01-12 Msc Intellectual Properties B.V. System and method for automatic document classification in ediscovery, compliance and legacy information clean-up
CN106126734B (en) * 2016-07-04 2019-06-28 北京奇艺世纪科技有限公司 The classification method and device of document
CN107833603B (en) * 2017-11-13 2021-03-23 医渡云(北京)技术有限公司 Electronic medical record document classification method and device, electronic equipment and storage medium
CN109388712A (en) * 2018-09-21 2019-02-26 平安科技(深圳)有限公司 A kind of trade classification method and terminal device based on machine learning
CN109840541A (en) * 2018-12-05 2019-06-04 国网辽宁省电力有限公司信息通信分公司 A kind of network transformer Fault Classification based on XGBoost
CN110021439B (en) * 2019-03-07 2023-01-24 平安科技(深圳)有限公司 Medical data classification method and device based on machine learning and computer equipment
CN110209812B (en) * 2019-05-07 2022-04-22 北京地平线机器人技术研发有限公司 Text classification method and device
CN114819186A (en) * 2019-06-18 2022-07-29 第四范式(北京)技术有限公司 Method and device for constructing GBDT model, and prediction method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018219198A1 (en) * 2017-06-02 2018-12-06 腾讯科技(深圳)有限公司 Man-machine interaction method and apparatus, and man-machine interaction terminal
CN109858248A (en) * 2018-12-26 2019-06-07 中国科学院信息工程研究所 Malice Word document detection method and device
CN109684272A (en) * 2018-12-29 2019-04-26 国家电网有限公司 Document storage method, system and terminal device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于情感特征向量空间模型的中文商品评论倾向分类算法;董祥和;;计算机应用与软件(第08期);全文 *

Also Published As

Publication number Publication date
CN111177375A (en) 2020-05-19

Similar Documents

Publication Publication Date Title
CN111177375B (en) Electronic document classification method and device
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
WO2021072885A1 (en) Method and apparatus for recognizing text, device and storage medium
CN111126065B (en) Information extraction method and device for natural language text
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
CN111444330A (en) Method, device and equipment for extracting short text keywords and storage medium
US11113470B2 (en) Preserving and processing ambiguity in natural language
CN111460170B (en) Word recognition method, device, terminal equipment and storage medium
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN109902290B (en) Text information-based term extraction method, system and equipment
CN110210028A (en) For domain feature words extracting method, device, equipment and the medium of speech translation text
CN111309916B (en) Digest extracting method and apparatus, storage medium, and electronic apparatus
CN114416979A (en) Text query method, text query equipment and storage medium
CN114358001A (en) Method for standardizing diagnosis result, and related device, equipment and storage medium thereof
CN114398968B (en) Method and device for labeling similar customer-obtaining files based on file similarity
CN114970514A (en) Artificial intelligence based Chinese word segmentation method, device, computer equipment and medium
CN112528653B (en) Short text entity recognition method and system
CN113590811A (en) Text abstract generation method and device, electronic equipment and storage medium
Chen et al. Encoding implicit relation requirements for relation extraction: A joint inference approach
CN115858776B (en) Variant text classification recognition method, system, storage medium and electronic equipment
CN109446321B (en) Text classification method, text classification device, terminal and computer readable storage medium
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN115840808A (en) Scientific and technological project consultation method, device, server and computer-readable storage medium
CN111916169B (en) Traditional Chinese medicine electronic medical record structuring method and terminal
CN117573956B (en) Metadata management method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant