CN114117038A

CN114117038A - Document classification method, device and system and electronic equipment

Info

Publication number: CN114117038A
Application number: CN202111307563.1A
Authority: CN
Inventors: 宋瑞霞; 金友兵
Original assignee: Nanjing Joyocloud Information Technology Co ltd
Current assignee: Nanjing Joyocloud Information Technology Co ltd
Priority date: 2021-11-05
Filing date: 2021-11-05
Publication date: 2022-03-01

Abstract

The application discloses a document classification method, a device, a system and electronic equipment, wherein the document classification method comprises the following steps: comparing the similarity of the candidate words and a preset classification to obtain a comparison result; wherein the candidate words are extracted from the document; the preset classification comprises a preset classification label; and when the comparison result shows that the similarity between the candidate word and any one of the preset classifications reaches a preset standard, determining that the document belongs to the preset classification with the similarity reaching the preset standard. The document classification method and the document classification device can solve the problem that the document classification efficiency is too low artificially.

Description

Document classification method, device and system and electronic equipment

Technical Field

The application relates to the technical field of computer application, in particular to a document classification method, a device, a system and electronic equipment.

Background

With the development of mass storage technology, the amount of data in individuals and enterprises is rapidly increasing. Electronic document storage and management has also become a significant problem.

How to classify documents can automatically classify similar documents and a plurality of versions of a document into one class, and different documents can be obviously distinguished, so that the documents become valuable for management, searching and reading of users.

At present, documents are generally sorted and classified by manual means, which is inefficient. Automatic classification is also mostly based on some rule methods, and the creation and management of the rules are also a troublesome problem.

Disclosure of Invention

The present application is proposed to solve the above-mentioned technical problems. The embodiment of the application provides a document classification method, a document classification device, a document classification system and electronic equipment, and can solve the problem that manual document classification efficiency is too low.

According to an aspect of the present application, there is provided a document classification method including: comparing the similarity of the candidate words and a preset classification to obtain a comparison result; wherein the candidate words are extracted from the document; the preset classification comprises a preset classification label; and when the comparison result shows that the similarity between the candidate word and any one of the preset classifications reaches a preset standard, determining that the document belongs to the preset classification with the similarity reaching the preset standard.

In one embodiment, the document classification method further includes: acquiring a preset classification label; the preset classification label comprises a plurality of layer levels, and the layer levels are preset; layering the candidate words according to the number of the hierarchies of preset classification labels to form the hierarchy labels of the candidate words, wherein the hierarchy label of each candidate word is characterized by the candidate word of the hierarchy; wherein, comparing the similarity of the candidate words and the preset classification to obtain a comparison result comprises: and comparing the similarity of the hierarchy label of the candidate word with the corresponding hierarchy of the preset classification label to obtain a comparison result.

In an embodiment, after comparing the similarity between the hierarchical label of the candidate word and the hierarchy corresponding to the preset classification label to obtain a comparison result, the document classification method further includes: and when the comparison result shows that the similarity of the hierarchy label of the candidate word and the hierarchy corresponding to the preset classification label reaches the preset similarity, determining that the document belongs to the corresponding hierarchy of the preset classification label.

In an embodiment, after comparing the similarity between the hierarchical label of the candidate word and the hierarchy corresponding to the preset classification label to obtain a comparison result, the document classification method further includes: and when the comparison result shows that the similarity between the hierarchy label of the candidate word and the hierarchy corresponding to the preset classification label is smaller than the preset similarity, creating a new hierarchy classification label according to the hierarchy label of the candidate word, and adding the new hierarchy classification label to the corresponding hierarchy of the preset classification label.

In one embodiment, the method for extracting the candidate word from the document comprises the following steps: and extracting a plurality of keywords as the candidate words according to the title of the document and the text content of the document.

In an embodiment, the layering the candidate words according to the hierarchical number of preset classification labels, and forming the hierarchical labels of the candidate words includes: and layering through the superior-inferior relation or the parallel relation between the candidate words to form a hierarchical label of the candidate words.

In an embodiment, the comparing the similarity between the hierarchical label of the candidate word and the hierarchy corresponding to the preset classification label to obtain a comparison result includes: and comparing the similarity of the candidate words in the hierarchy labels of the candidate words with the vocabulary in the hierarchy corresponding to the preset classification label to obtain a comparison result.

According to another aspect of the present application, there is provided a document classification apparatus including: the comparison module is used for comparing the similarity of the candidate words and the preset classification to obtain a comparison result; wherein the candidate words are extracted from the document; the preset classification comprises a preset classification label; and the determining module is used for determining that the document belongs to the preset classification with the similarity reaching the preset standard when the comparison result shows that the similarity of the candidate word and any one of the preset classifications reaches the preset standard.

According to another aspect of the present application, there is provided a document classification system including: a document input unit for receiving an input document; the document extractor is used for identifying the title and the text content of the document; the candidate word relation extractor is used for extracting the title of the document and the keywords of the text content as candidate words and layering the candidate words; the similarity calculator is used for calculating the similarity between the candidate words and a preset classification; the classification recorder is used for recording the hierarchy relation between the preset classification label and the preset classification label; a classification outputter for outputting classification information of the document; and a controller for executing the document classification method according to any one of the above embodiments.

According to another aspect of the present application, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; the processor is configured to execute the document classification method according to any of the embodiments.

According to the document classification method, the device, the system and the electronic equipment, the similarity calculation can be performed by comparing the keywords in the preset classification stored at present through a method of extracting the keywords as candidate words, the classification of the documents is judged according to the similarity, the calculation mode is simple, the calculation efficiency is high, the classification of the documents is accurate, automatic classification of the documents is achieved, the document classification efficiency is improved, the probability that manual classification may have problems is reduced, and the problem that manual classification efficiency of the documents is too low is solved.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1 is a diagram of a system to which the present application is applicable.

FIG. 2 is a flowchart illustrating a document classification method according to an exemplary embodiment of the present application.

FIG. 3 is a flowchart illustrating a document classification method according to another exemplary embodiment of the present application.

FIG. 4 is a schematic diagram illustrating a document classification method according to an exemplary embodiment of the present application.

Fig. 5 is a schematic structural diagram of a document classification device according to an exemplary embodiment of the present application.

Fig. 6 is a schematic structural diagram of a document classification device according to another exemplary embodiment of the present application.

Fig. 7 is a block diagram of an electronic device provided in an exemplary embodiment of the present application.

Description of reference numerals: 31. a document input unit; 32. a document extractor; 33. a candidate word relationship extractor; 34. a similarity calculator; 35. a classification recorder; 36. and a classification output device.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.

Exemplary System

Fig. 1 is a system diagram to which the present application is applicable, and as shown in fig. 1, the present application can be applied to a document classification system 3 including: a document input unit 31 for receiving an input document; a document extractor 32 for performing title and text content recognition on the document; a candidate word relation extractor 33, configured to extract keywords of the title and the text content of the document as candidate words, and layer the candidate words; a similarity calculator 34, configured to calculate a similarity between the candidate word and a preset classification; the classification recorder 35 is used for recording the preset classification labels and the hierarchical relationship of the preset classification labels; a classification outputter 36 for outputting classification information of the document; and the controller is used for executing any document classification method provided by the application.

The document input device 31 can receive input electronic documents in various formats, such as text, word, pdf, etc., and the input documents need to be documents that can be extracted by the following document extractor 32.

The document extractor 32 can identify the title and text content of the inputted electronic document, and can identify and extract only the file name or file title for the unrecognized file.

The candidate word relation extractor 33 extracts a plurality of keywords from the title and text content of the document, extracts the relation between candidate words using the plurality of keywords as candidate words, and establishes a plurality of hierarchical tags according to the relation between the candidate words.

The similarity calculator 34 is configured to calculate a similarity between the candidate word and a preset classification, and may perform similarity calculation according to the candidate word and the vocabulary in the preset classification, that is, perform similarity calculation on two phrases.

The classification recorder 35 is configured to record a classification label of a preset classification and a hierarchical relationship of the label, that is, currently existing classification information, including a classification name, a hierarchical relationship, a classification feature (that is, a classified keyword), and the like.

The classification outputter 36 finally outputs the classification information of the input document, and may store the classification information separately in one database, or store the classification information as the attribute of the document, or store the document to a specified directory location by naming a directory name according to each layer of classification tags.

Exemplary method

Fig. 2 is a flowchart illustrating a document classification method according to an exemplary embodiment of the present application, where as shown in fig. 2, the document classification method includes:

step 110: and comparing the similarity of the candidate words and the preset classification to obtain a comparison result.

Wherein the candidate words are extracted from the document; the preset classification includes a preset classification label.

The preset classification is a classification label recorded at present and a hierarchical relation of the labels, and is stored in the classification recorder. Wherein the first level category labels may or may not be predefined. If the primary label is predefined, the characteristics of the primary label also need to be specified, which can be a group of characteristic vocabularies specified in advance, or a preset text description is used for extracting keywords and then forming the characteristics of the primary classification label. The total number of layer levels for the tag is predefined.

And extracting keywords from each document to be classified as candidate words, and comparing the candidate words with characteristic words representing preset classifications to obtain a similarity result.

Step 120: and when the comparison result shows that the similarity between the candidate word and any one of the preset classifications reaches a preset standard, determining that the document belongs to the preset classification of which the similarity reaches the preset standard.

The candidate words are generally a group of words, each category of the preset categories is also a group of words, and the category specified by the maximum similarity can be obtained by adopting an algorithm of similarity of the two groups of words.

According to the document classification method, the similarity calculation can be performed by comparing the keywords in the preset classification through a method of extracting the keywords as candidate words, the classification of the documents is judged according to the similarity, the calculation mode is simple, the calculation efficiency is high, the classification of the documents is accurate, the automatic document classification efficiency is improved, and the probability of problems caused by manual classification is reduced.

Fig. 3 is a flowchart illustrating a document classification method according to another exemplary embodiment of the present application, and as shown in fig. 3, the document classification method may further include:

step 130: and acquiring a preset classification label.

The preset classification label comprises a plurality of layer levels, and the layer levels are preset.

The category recorder records all current category labels and the hierarchy of labels. The number of levels M of the preset category labels is predefined.

Step 140: and layering the candidate words according to the number of the levels of the preset classification labels to form the level labels of the candidate words, wherein the level label of each candidate word is characterized by the candidate word of the level.

Extracting candidate words from a document to be classified, establishing a plurality of hierarchical relations with the same number as that of preset classification tags according to all the candidate words, and establishing 1-M hierarchical tags of the document according to the hierarchical relations, so that the document is classified into the 1-M hierarchical tags based on the similarity between the hierarchical tags of the document and the existing classification tags.

Correspondingly, the step 110 may be adjusted to:

step 111: and comparing the similarity of the hierarchy labels of the candidate words with the corresponding hierarchy of the preset classification labels to obtain a comparison result.

Correspondingly, the document can be classified into the 1-M classification labels based on the similarity between the document hierarchy label and the existing classification labels. For 1-M grades of classification labels of a document, firstly, judging the similarity between a candidate word of each grade and the same grade of classification label in the current preset classification, and if the similarity is greater than a specified threshold value, determining that the document belongs to the classification label; if the similarity is smaller than the specified threshold, a classification label of the level is newly added. And finally completing the 1-M hierarchical classification of the document. If the first-level classification label is predefined and cannot be newly added, the first-level classification can be directly selected without judging whether the first-level classification is greater than the preset similarity, or one classification can be designated in the first-level classification to call other classifications, and the other classifications are selected under the condition that all the classifications are not greater than the preset similarity.

When the similarity of the two groups of words is calculated, synonyms and similar words in the phrases can be judged based on some universal knowledge bases, and the calculation precision of the similarity is improved.

In an embodiment, after the step 111, the document classification method may further include: and when the comparison result shows that the similarity of the hierarchical label of the candidate word and the preset classification label reaches the preset similarity, determining that the document belongs to the corresponding hierarchy of the preset classification label.

For the 1-M classification labels of the document, firstly, the similarity between the candidate words of each level and the same classification label in the current preset classification is judged, and if the similarity reaches the preset similarity, the document is considered to belong to the classification label.

In an embodiment, after the step 111, the document classification method may further include: and when the comparison result shows that the similarity between the hierarchy label of the candidate word and the corresponding hierarchy of the preset classification label is smaller than the preset similarity, creating a new hierarchy classification label according to the hierarchy label of the candidate word, and adding the new hierarchy classification label to the corresponding hierarchy of the preset classification label.

For the 1-M level classification labels of the documents, firstly, the similarity between the candidate words at each level and the classification labels at the same level after classification at present is judged, and if the similarity does not reach the preset similarity, a classification label at the level is newly added. And finally completing the 1-M hierarchical classification of the document.

The classification recorder records all the sub-classification names and features of each current layer, and the preset similarity can be generally selected to be 90%. Carrying out similarity matching on candidate words of a certain level of the document and all sub-classifications in the classification recorder, wherein if the maximum similarity is greater than the preset similarity, the document belongs to the matching classification with the maximum similarity; if the maximum similarity is smaller than the preset similarity, a new classification label is built based on the level candidate words and stored in a classification recorder, and the level label of the document is also set as the new classification label. The name of each new classification label may be a combination of related candidate words or some unique algorithm value, such as a HASH value, and the characteristics of the classification label are related candidate words.

In one embodiment, a method of extracting candidate words from a document may include: and extracting a plurality of keywords as candidate words according to the title of the document and the text content of the document.

When extracting keywords based on document titles and content, the keywords may be extracted by using the titles as part of the document content, or the keywords may be extracted by using the titles and the content as different weights, so as to enlarge the importance of the document titles in classification.

For example, keywords in the title may have a higher priority than keywords in the document content, or the title may be copied multiple times and then keyword extraction may be performed on the merged document content, which may increase the weight of the title.

In one embodiment, the step 140 can be adjusted to: and layering through the superior-inferior relation or the parallel relation between the candidate words to form the hierarchical label of the candidate words.

When the hierarchical relation is established for the candidate words, the synonym and the similar words can be judged based on some general knowledge bases, and the superior-inferior relation or the parallel relation between the candidate words is formed, so that the accuracy of the hierarchical relation is improved. And judging the similarity of the candidate words and the classification labels of the preset levels by adopting a word or vocabulary similarity calculation mode. A hierarchical relationship among a group of candidate words adopts a relation extraction algorithm in a natural language, 2-M hierarchical relationships are formed by establishing a superior-subordinate relationship and a parallel relationship among the relations, and at least one candidate word in each layer needs to be ensured. And then carrying out similarity matching on the candidate words of each hierarchy and the current preset hierarchy candidate words.

In an embodiment, the step 111 can be adjusted to: and comparing the similarity of the candidate words in the hierarchy labels of the candidate words with the vocabulary in the hierarchy corresponding to the preset classification label to obtain a comparison result.

The similarity of the document level labels and the classification labels can be calculated by comparing two groups of words, the candidate words are generally a group of words, the preset classification is also a group of words, and the preset classification appointed by the maximum similarity can be obtained by adopting an algorithm of the similarity of the two groups of words.

FIG. 4 is a schematic diagram illustrating a document classification method according to an exemplary embodiment of the present application, in which a primary classification list of documents is predefined (step 48), and characteristics of the primary classification, i.e., labels of the primary classification, are preset, as shown in FIG. 4. The first-level classification features can be obtained by pre-designating a group of words as keywords or extracting the keywords through a section of text description. The names and features of the primary classifications are stored in a classification logger (step 49). The label level number M is given in advance, i.e. there are M-1 label levels below one level, but the labels and label levels below one level of classification are automatically generated according to the algorithm of the present invention.

The document is input into a document input device (step 41), and for each document to be classified, the document title and the text content of the document are extracted (step 42), and N keywords are further extracted as candidate words in the title and the text content using a keyword extraction algorithm (step 43) (step 44). Generally, M is 3-6 levels, which is common, and N is usually selected to be an integral multiple of M, so that the number of keywords in each level can be close, and the processing is simpler.

The method for extracting the keywords in the title and the content is not limited, and various common keyword extraction algorithms can be adopted. The keyword extraction can be performed by taking the document title as a part of the document content, and the keyword extraction can also be performed by taking the title and the content as different weights, so that the importance of the document title in classification is amplified. For example, the keywords in the title have higher priority than the keywords in the document content, or the title is copied for multiple times, and then keyword extraction is performed on the merged document content, so that the weight of the title is greater.

Further, M hierarchical relationships are established as specified above for all candidate words of the document (step 46), and M hierarchical tags of the candidate words of the document are established through the hierarchical relationships, so that a certain hierarchical tag of the document is characterized by some words in the candidate words of the document. The hierarchical relationship between a group of candidate words is formed by establishing the superior-inferior relationship and the parallel relationship between the relationships by adopting a relationship extraction algorithm (step 45) in the natural language, namely, 1-M hierarchical labels are formed, and at least one candidate word is required to be ensured in each layer.

And traversing the candidate words in each level label (step 47), and performing similarity matching on the candidate words of each level and the keywords of the corresponding level classification label stored in the classification recorder and preset at present (step 50). When the similarity of the two groups of words is calculated, synonyms and similar words in the phrases can be judged based on some universal knowledge bases, and the calculation precision of the similarity is improved.

Judging the maximum similarity between the candidate words of each hierarchy and the keywords of the currently preset corresponding hierarchy classification label stored in the classification recorder (step 51), if the maximum matching value of the similarity between the candidate words of each hierarchy and the keywords in a certain label of the corresponding hierarchy in the currently preset classification is larger than the preset similarity, determining that the document belongs to the classification label, and selecting the classification label of the hierarchy in the classification recorder as the hierarchy label of the document (step 56). If the similarity is smaller than a specified threshold and is not the first level, the candidate words in the level are used as features or names, and a classification of the level is newly added; if the similarity is less than the specified threshold and is currently the first class, the class of the class with the highest similarity is selected, or the class of the class is called other class, and the other class is selected.

The classification recorder records all the sub-classification names and features of each current layer, and the preset similarity can be generally selected to be 90%. Carrying out similarity matching on candidate words of a certain level of the document and all sub-classifications of the corresponding level in the classification recorder, wherein if the maximum similarity is greater than the preset similarity, the document belongs to the matching classification with the maximum similarity; if the maximum similarity is smaller than the preset similarity, first judging whether the classification is a first-class classification (step 52), creating a new classification label based on the hierarchy candidate word for the non-first-class classification (step 53), storing the new classification label in a corresponding hierarchy in a classification recorder (step 54), and also setting the hierarchy label of the document as the new classification label, wherein the name of each new classification can be a combination of related candidate words or a certain uniqueness algorithm value, such as a HASH value, and the characteristics of the classification are related candidate words.

For the judgment of the first class classification, because the first class classification cannot be newly added, the first class classification with the maximum similarity can be directly selected without judging whether the first class classification is larger than the preset similarity, or one class classification can be designated as another class in the first class classification, and the other class classification is selected under the condition that all classes are not larger than the preset similarity (step 55).

Checking whether the traversal is completed (step 57), if the traversal is not completed, executing step 47, if the traversal is completed, outputting a document classification result (step 58), completing the matching and classification of all levels, and finally completing the classification of 1-M levels of the document.

The hierarchical label of the candidate word 1-M of each document can be stored in a database as the document name of each document, or can be stored as the attribute of the document, or the document is stored to the appointed directory position according to the category name of each layer.

Exemplary devices

Fig. 5 is a schematic structural diagram of a document sorting apparatus according to an exemplary embodiment of the present application, and as shown in fig. 5, the document sorting apparatus 8 includes: the comparison module 81 is used for comparing the similarity of the candidate words and the preset classification to obtain a comparison result; wherein the candidate words are extracted from the document; the preset classification comprises a preset classification label; and the determining module 82 is configured to determine that the document belongs to a preset classification with similarity reaching a preset standard when the comparison result indicates that the similarity of the candidate word and any one of the preset classifications reaches the preset standard.

The document classification device provided by the application can compare candidate words and preset keywords in a preset classification through the comparison module 81, carries out similarity calculation, judges the classification of documents according to the similarity through the determination module 82, is simple in calculation mode, high in calculation efficiency, accurate in classification of documents, improves the automatic classification efficiency of the documents, and reduces the probability of problems caused by manual classification.

Fig. 6 is a schematic structural diagram of a document classification device according to another exemplary embodiment of the present application. As shown in fig. 6, the document classification device 8 may further include: an obtaining module 83, configured to obtain a preset classification label; the forming module 84 is configured to layer the candidate words according to the hierarchy number of preset classification tags to form hierarchy tags of the candidate words, where the hierarchy tag of each candidate word is characterized by a candidate word of the hierarchy; correspondingly, the comparing module 81 may be further configured to: the comparison unit 811 is configured to compare the similarity between the hierarchy labels of the candidate words and the hierarchy corresponding to the preset classification labels, and obtain a comparison result.

In an embodiment, as shown in fig. 6, the document classification device 8 may be further configured to: and when the comparison result shows that the similarity of the hierarchical label of the candidate word and the preset classification label reaches the preset similarity, determining that the document belongs to the corresponding hierarchy of the preset classification label.

In an embodiment, as shown in fig. 6, the document classification device 8 may be further configured to: and when the comparison result shows that the similarity between the hierarchy label of the candidate word and the corresponding hierarchy of the preset classification label is smaller than the preset similarity, creating a new hierarchy classification label according to the hierarchy label of the candidate word, and adding the new hierarchy classification label to the corresponding hierarchy of the preset classification label.

In an embodiment, as shown in fig. 6, the document classification device 8 may be further configured to: and extracting a plurality of keywords as candidate words according to the title of the document and the text content of the document.

In one embodiment, as shown in fig. 6, the forming module 84 may be further configured to: and layering through the superior-inferior relation or the parallel relation between the candidate words to form the hierarchical label of the candidate words.

In an embodiment, as shown in fig. 6, the comparing unit 811 may be further configured to: and comparing the similarity of the candidate words in the hierarchy labels of the candidate words with the vocabulary in the hierarchy corresponding to the preset classification label to obtain a comparison result.

Exemplary electronic device

Next, an electronic apparatus according to an embodiment of the present application is described with reference to fig. 7. The electronic device may be either or both of the first device and the second device, or a stand-alone device separate from them, which stand-alone device may communicate with the first device and the second device to receive the acquired input signals therefrom.

FIG. 7 illustrates a block diagram of an electronic device in accordance with an embodiment of the present application.

As shown in fig. 7, the electronic device 10 includes one or more processors 11 and memory 12.

The processor 11 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.

Memory 12 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 11 to implement the document classification methods of the various embodiments of the present application described above and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 10 may further include: an input device 13 and an output device 14, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

When the electronic device is a stand-alone device, the input means 13 may be a communication network connector for receiving the acquired input signals from the first device and the second device.

The input device 13 may also include, for example, a keyboard, a mouse, and the like.

The output device 14 may output various information including the determined distance information, direction information, and the like to the outside. The output devices 14 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.

Of course, for simplicity, only some of the components of the electronic device 10 relevant to the present application are shown in fig. 7, and components such as buses, input/output interfaces, and the like are omitted. In addition, the electronic device 10 may include any other suitable components depending on the particular application.

The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A method of classifying a document, comprising:

comparing the similarity of the candidate words and a preset classification to obtain a comparison result; wherein the candidate words are extracted from the document; the preset classification comprises a preset classification label; and

and when the comparison result shows that the similarity between the candidate word and any one of the preset classifications reaches a preset standard, determining that the document belongs to the preset classification with the similarity reaching the preset standard.

2. The document classification method according to claim 1, further comprising:

acquiring a preset classification label; the preset classification label comprises a plurality of layer levels, and the layer levels are preset;

layering the candidate words according to the number of the hierarchies of preset classification labels to form the hierarchy labels of the candidate words, wherein the hierarchy label of each candidate word is characterized by the candidate word of the hierarchy;

wherein, comparing the similarity of the candidate words and the preset classification to obtain a comparison result comprises:

and comparing the similarity of the hierarchy label of the candidate word with the corresponding hierarchy of the preset classification label to obtain a comparison result.

3. The document classification method according to claim 2, wherein after comparing the similarity between the hierarchical label of the candidate word and the corresponding hierarchical level of the preset classification label to obtain a comparison result, the method further comprises:

and when the comparison result shows that the similarity of the hierarchy label of the candidate word and the hierarchy corresponding to the preset classification label reaches the preset similarity, determining that the document belongs to the corresponding hierarchy of the preset classification label.

4. The document classification method according to claim 2, wherein after comparing the similarity between the hierarchical label of the candidate word and the corresponding hierarchical level of the preset classification label to obtain a comparison result, the method further comprises:

and when the comparison result shows that the similarity between the hierarchy label of the candidate word and the hierarchy corresponding to the preset classification label is smaller than the preset similarity, creating a new hierarchy classification label according to the hierarchy label of the candidate word, and adding the new hierarchy classification label to the corresponding hierarchy of the preset classification label.

5. The method of classifying a document according to claim 1, wherein the method of extracting the candidate word from the document comprises:

and extracting a plurality of keywords as the candidate words according to the title of the document and the text content of the document.

6. The document classification method according to claim 2, wherein the step of layering the candidate words according to the hierarchical number of preset classification tags comprises:

and layering through the superior-inferior relation or the parallel relation between the candidate words to form a hierarchical label of the candidate words.

7. The document classification method according to claim 2, wherein the comparing the similarity between the hierarchical label of the candidate word and the corresponding hierarchical level of the preset classification label to obtain a comparison result comprises:

and comparing the similarity of the candidate words in the hierarchy labels of the candidate words with the vocabulary in the hierarchy corresponding to the preset classification label to obtain a comparison result.

8. A document sorting apparatus, comprising:

the comparison module is used for comparing the similarity of the candidate words and the preset classification to obtain a comparison result; wherein the candidate words are extracted from the document; the preset classification comprises a preset classification label; and

and the determining module is used for determining that the document belongs to the preset classification with the similarity reaching a preset standard when the comparison result shows that the similarity of the candidate word and any one of the preset classifications reaches the preset standard.

9. A document classification system, comprising:

a document input unit for receiving an input document;

the document extractor is used for identifying the title and the text content of the document;

the candidate word relation extractor is used for extracting the title of the document and the keywords of the text content as candidate words and layering the candidate words;

the similarity calculator is used for calculating the similarity between the candidate words and a preset classification;

the classification recorder is used for recording the hierarchy relation between the preset classification label and the preset classification label;

a classification outputter for outputting classification information of the document; and

a controller for performing the document classification method of any of claims 1-7 above.

10. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor for performing the document classification method of any of the preceding claims 1-7.