CN111930944A

CN111930944A - File label classification method and device

Info

Publication number: CN111930944A
Application number: CN202010806917.6A
Authority: CN
Inventors: 虞樱
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2020-08-12
Filing date: 2020-08-12
Publication date: 2020-11-13
Anticipated expiration: 2040-08-12
Also published as: CN111930944B

Abstract

The invention provides a file label classification method and a device, wherein the method comprises the following steps: extracting a subject term of each node in a label system tree, wherein the label system tree comprises the subject terms of multi-level nodes of a plurality of files; adding the subject term of the upward multi-level node of each leaf node in the label system tree into a category group to form a subject term set corresponding to each category group; acquiring a feature word set corresponding to each category group, and adding feature words in the feature word set corresponding to each category group into a subject word set corresponding to the category group; creating a reverse-ordering index file tree according to the subject term set and the files corresponding to each category group, and constructing a training set based on the reverse-ordering index file tree; training a text classification model by adopting the training set to obtain a trained text classification model; and inputting the files to be classified into the trained text classification model, and predicting the label classification of the files to be classified. The invention can quickly and accurately classify the file labels.

Description

File label classification method and device

Technical Field

The invention relates to the technical field of computers, in particular to a file label classification method and device.

Background

The classification of document labels is a wide requirement of document management at present, for example, in an audit process inside a commercial bank, various regulatory documents officially published inside a bank system are required to be taken as an audit basis. Large-scale banks have numerous personnel, careful division of labor in departments and various business varieties, so that the scale of the regulation and control documents is more and more huge. The auditor can quickly and accurately position the required system files in a system library with a large scale, and becomes a key element for improving the auditing efficiency.

Currently, a file label classification method includes: artificial methods, shallow machine learning model methods, deep machine learning model methods, and the like.

The manual method manually classifies and labels the redundant 10 thousands of files in each file library one by one, so that the time and labor are wasted, the error rate is high, and the personnel are required to have professional skills. And files in various file libraries are frequently updated, a label system is advanced all the time and is frequently renewed, and the requirement of service development cannot be met by depending on a manual classification label mode.

The shallow machine learning model approach is an algorithm that predicts new data objects through historical data associated with predicted labels, and requires a large amount of training data; requiring a great deal of manual labor for labeling and relying on very specialized knowledge. In addition, the system document of the financial institution is usually kept secret or partially kept secret, and the actual operation is very difficult; these shallow machine learning algorithms all have data sparsity issues and require manual definition and selection of features, which is very expensive.

The deep learning model method is a machine learning algorithm based on a linear model, and has the advantages of simple model, high convergence rate and no feature selection. The method is suitable for data with large data volume and less characteristics. The deep machine learning model solves the problem of sparse features in the traditional shallow machine learning method, and directly classifies texts end to end based on a deep neural network, so that the method has the greatest advantages of simplicity, no need of designing a particularly complex process, and even no need of having a particularly deep understanding on a certain problem, because data is directly input into the hands. However, this also has the big disadvantage of data redundancy, and many data or information irrelevant to the problem seriously affect the final result, or cause the model to be possibly huge, the optimization is difficult, the resources are wasted, and so on.

In summary, a fast and accurate file tag classification method is lacking at present.

Disclosure of Invention

The embodiment of the invention provides a file label classification method, which is used for rapidly and accurately classifying file labels and comprises the following steps:

extracting a subject term of each node in a label system tree, wherein the label system tree comprises the subject terms of multi-level nodes of a plurality of files;

adding the subject term of the upward multi-level node of each leaf node in the label system tree into a category group to form a subject term set corresponding to each category group;

obtaining a feature word set corresponding to each category group, and adding feature words in the feature word set corresponding to each category group into a subject word set corresponding to the category group, wherein the feature word set corresponding to each category group comprises feature words of a plurality of files under a leaf node corresponding to the category group;

creating an inverted sorting index file tree according to the subject term set and the files corresponding to each category group, and constructing a training set based on the inverted sorting index file tree;

training a text classification model by adopting the training set to obtain a trained text classification model;

and inputting the files to be classified into the trained text classification model, and predicting the label classification of the files to be classified.

The embodiment of the invention provides a file label classification device, which is used for rapidly and accurately classifying file labels and comprises the following components:

the system comprises a theme word extraction module, a document classification module and a document classification module, wherein the theme word extraction module is used for extracting the theme word of each node in a label system tree, and the label system tree comprises the theme words of multi-level nodes of a plurality of documents;

the first subject term generation module is used for adding the subject terms of the upward multi-level nodes of each leaf node in the label system tree into a category group to form a subject term set corresponding to each category group;

the second subject word generation module is used for obtaining a feature word set corresponding to each category group, and adding the feature words in the feature word set corresponding to each category group into the subject word set corresponding to the category group, wherein the feature word set corresponding to each category group comprises the feature words of a plurality of files under a leaf node corresponding to the category group;

the training set construction module is used for creating a reverse-ordering index file tree according to the subject term set and the files corresponding to each category group, and constructing a training set based on the reverse-ordering index file tree;

the model training module is used for training the text classification model by adopting the training set to obtain a trained text classification model;

and the prediction module is used for inputting the files to be classified into the trained text classification model and predicting the label classification of the files to be classified.

The embodiment of the invention also provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the file label classification method when executing the computer program.

The embodiment of the invention also provides a computer readable storage medium, and the computer readable storage medium stores a computer program for executing the file label classification method.

In the embodiment of the invention, the subject term of each node in a label system tree is extracted, wherein the label system tree comprises the subject terms of multi-level nodes of a plurality of files; adding the subject term of the upward multi-level node of each leaf node in the label system tree into a category group to form a subject term set corresponding to each category group; obtaining a feature word set corresponding to each category group, and adding feature words in the feature word set corresponding to each category group into a subject word set corresponding to the category group, wherein the feature word set corresponding to each category group comprises feature words of a plurality of files under a leaf node corresponding to the category group; creating an inverted sorting index file tree according to the subject term set and the files corresponding to each category group, and constructing a training set based on the inverted sorting index file tree; training a text classification model by adopting the training set to obtain a trained text classification model; and inputting the files to be classified into the trained text classification model, and predicting the label classification of the files to be classified. In the embodiment, the subject term set corresponding to the category group is automatically obtained, so that the manual extraction of the label subject terms is avoided, and the efficiency is high; and creating an inverted sorting index file tree according to the subject term set and the files corresponding to each category group, and constructing a training set based on the inverted sorting index file tree, namely automatically constructing the training set, so that the efficiency and the accuracy of constructing the training set are improved, and the label classification of the file to be classified is predicted more accurately finally.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:

FIG. 1 is a flowchart of a file tag classification method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a tag hierarchy tree in an embodiment of the present invention;

FIG. 3 is a diagram illustrating a reverse ordered index file tree according to an embodiment of the present invention;

FIG. 4 is a detailed flowchart of a file tag classification method according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a document tag sorting apparatus according to an embodiment of the present invention;

FIG. 6 is a diagram of a computer device in an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

In the description of the present specification, the terms "comprising," "including," "having," "containing," and the like are used in an open-ended fashion, i.e., to mean including, but not limited to. Reference to the description of the terms "one embodiment," "a particular embodiment," "some embodiments," "for example," etc., means that a particular feature, structure, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. The sequence of steps involved in the embodiments is for illustrative purposes to illustrate the implementation of the present application, and the sequence of steps is not limited and can be adjusted as needed.

Fig. 1 is a flowchart of a file tag classification method in an embodiment of the present invention, and as shown in fig. 1, the method includes:

step 101, extracting a subject term of each node in a label system tree, wherein the label system tree comprises the subject terms of multi-level nodes of a plurality of files;

102, adding the subject term of the upward multi-level node of each leaf node in the label system tree into a category group to form a subject term set corresponding to each category group;

103, obtaining a feature word set corresponding to each category group, and adding feature words in the feature word set corresponding to each category group into a subject word set corresponding to the category group, wherein the feature word set corresponding to each category group comprises feature words of a plurality of files under a leaf node corresponding to the category group;

104, creating an inverted sorting index file tree according to the subject term set and the files corresponding to each category group, and constructing a training set based on the inverted sorting index file tree;

105, training a text classification model by using the training set to obtain a trained text classification model;

and 106, inputting the file to be classified into the trained text classification model, and predicting the label classification of the file to be classified.

In the embodiment of the invention, the subject term set corresponding to the category group is automatically obtained, the manual extraction of the label subject terms is avoided, and the efficiency is high; and creating an inverted sorting index file tree according to the subject term set and the files corresponding to each category group, and constructing a training set based on the inverted sorting index file tree, namely automatically constructing the training set, so that the efficiency and the accuracy of constructing the training set are improved, and the label classification of the file to be classified is predicted more accurately finally.

In specific implementation, fig. 2 is a schematic diagram of a tag system tree in the embodiment of the present invention, a leaf node L511111 includes a plurality of files, problem files, program files, and the like, for example, if the tag system tree is directed to an internal audit system, the files are system files, such as an audit problem file, an audit data file, an audit basis file, and an audit program file, and in step 101, subject words of each node, which are tag subject words, are extracted.

In step 102, the subject term of the upward multi-level node of each leaf node in the tag hierarchy tree is added into a category group to form a subject term set corresponding to each category group, so that each leaf node has a subject term set.

In one embodiment, obtaining the feature word set corresponding to each category group includes:

performing word segmentation processing on the content in the file under each leaf node in the label system tree to obtain a word set corresponding to each category group, wherein the word set corresponding to each category group comprises word sets of a plurality of files under the leaf node corresponding to the category group;

and selecting words with a first set number and high frequency in the document from the word set corresponding to each category group as the characteristic word set corresponding to each category group, and deleting the characteristic words which all appear in the characteristic word set corresponding to each category group.

In the foregoing embodiment, since there are multiple files under each leaf node, and each leaf node has a category group, then performing word segmentation processing on multiple file contents under each leaf node, each file content can form a word vector, and the category group of each leaf node also corresponds to a word set, in an embodiment, after performing word segmentation processing on the content in the file under each leaf node in the tag hierarchy tree, and obtaining the word set corresponding to each category group, the method further includes:

and deleting the stop words, the high-frequency words and the low-frequency words in the word set corresponding to each category group.

In the above embodiment, deleting the stop words (e.g., even, but equal words), the high frequency words (in, yes, equal words), and the low frequency words (unusual words) in the word set can make the obtained words more reflect the characteristics of the file, and can also reduce the amount of words and reduce the processing amount of subsequent steps.

In an embodiment, selecting a first set number of words that appear in the document with high frequency from the word set corresponding to each category group as a feature word set corresponding to each category group includes:

and selecting words with a first set number and high frequency in the document from the word set corresponding to each category group by adopting a TF-IDF algorithm to serve as the characteristic word set corresponding to each category group.

The TF-IDF algorithm is a statistical method to evaluate the importance of a word to one of a set of documents or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of relevance between a document and a user query. The TF-IDF algorithm is proportional to the number of occurrences of a word in a document and inversely proportional to the number of occurrences of the word in the entire language. Therefore, selecting a first set number of words which appear in the file with high frequency by adopting the TF-IDF algorithm is to calculate the TF-IDF value of each word of the file, then arranging the words in a descending order and selecting the first words.

In step 103, the feature words in the feature word set corresponding to each category group are added to the subject word set corresponding to the category group, so that the labeled subject word set includes not only the subject words currently extracted from the label system tree but also the feature words newly screened from the file, thereby realizing the correction of the subject words labeled in the current label system tree and further improving the accuracy of the subsequently constructed training set.

In one embodiment, creating a reverse-ordered index file tree according to the topic word set and the plurality of files corresponding to each category group includes:

merging the subject terms corresponding to all the category groups;

selecting a second set number of files from the plurality of files;

and creating an inverted index file tree according to the subject term and the frequency of the subject term appearing in each selected file.

Fig. 3 is a schematic diagram of an inverted index file tree in an embodiment of the present invention, first merging subject terms corresponding to all category groups to form a set of subject terms 1 to subject terms N, then selecting a part of files, for example, files with a set number of 10%, from a huge file library, for example, a system file library of an internal auditing system, and then sorting the frequency of occurrence of each subject term in each file, taking subject term 1 as an example, file 1, file 5, and file N in sequence according to the frequency of occurrence, and subject term 2 is file 1, file 2, and file K in sequence according to the frequency of occurrence, so as to finally form an inverted index file tree.

In one embodiment, constructing a training set based on the inverted ordered index file tree includes:

calculating the weight of each file in the inverted index file tree according to the frequency and the position of the theme word appearing in each file based on the weight of each theme word;

and selecting a third set number of files from the second set number of files based on the weight of each file to construct a training set.

In the above embodiment, first, the weight of each subject word is determined according to empirical data, for example, the weight of subject word 1 is determined to be 20%, the weight of subject word 2 is determined to be 10%, and the like, and the frequency of occurrence of subject word 1 is the highest in document 1, and the occurrence positions are in the title and paragraph head, so the weight for determining document 1 should be very high. And finally, selecting a third set number of files from the second set number of files according to the weight of the files to construct a training set, wherein during selection, the files with the weight of TOP 30% can be selected to construct the training set.

In one embodiment, the text classification model is a TextCNN model.

TextCNN is a machine learning model, and then the trained text classification model, such as TextCNN model, can be trained using the training set to obtain the trained text classification model. And finally, inputting the files to be classified into the trained text classification model, and predicting the label classification of the files to be classified.

Based on the above embodiment, a detailed flowchart of a file tag classification method is given below, as shown in fig. 4, including:

step 401, extracting a subject term of each node in a label system tree, wherein the label system tree comprises subject terms of multi-level nodes of a plurality of files;

step 402, adding the subject term of the upward multi-level node of each leaf node in the label system tree into a category group to form a subject term set corresponding to each category group;

step 403, performing word segmentation processing on the content in the file under each leaf node in the tag system tree to obtain a word set corresponding to each category group;

step 404, deleting the stop words, the high-frequency words and the low-frequency words in the word set corresponding to each category group;

step 405, selecting a first set number of words which appear in the file with high frequency from the word set corresponding to each category group by adopting a TF-IDF algorithm, taking the words as a characteristic word set corresponding to each category group, and deleting the characteristic words which appear in the characteristic word set corresponding to each category group;

step 406, adding the feature words in the feature word set corresponding to each category group into the subject word set corresponding to the category group;

step 407, merging the subject terms corresponding to all the category groups;

step 408, selecting a second set number of files from the plurality of files;

step 409, creating an inverted index file tree according to the subject term and the frequency of the subject term appearing in each selected file;

step 410, based on the weight of each subject term, calculating the weight of each file in the inverted index file tree according to the frequency and the position of the subject term appearing in each file;

step 411, selecting a third set number of files from the second set number of files based on the weight of each file, and constructing a training set;

step 412, training a text classification model by using the training set to obtain a trained text classification model;

step 413, inputting the file to be classified into the trained text classification model, and predicting the label classification of the file to be classified.

It is to be understood, of course, that other embodiments are possible and that modifications are intended to fall within the scope of the invention.

In summary, in the method provided in the embodiment of the present invention, a topic word of each node in a label system tree is extracted, where the label system tree includes topic words of multi-level nodes of multiple files; adding the subject term of the upward multi-level node of each leaf node in the label system tree into a category group to form a subject term set corresponding to each category group; obtaining a feature word set corresponding to each category group, and adding feature words in the feature word set corresponding to each category group into a subject word set corresponding to the category group, wherein the feature word set corresponding to each category group comprises feature words of a plurality of files under a leaf node corresponding to the category group; creating an inverted sorting index file tree according to the subject term set and the files corresponding to each category group, and constructing a training set based on the inverted sorting index file tree; training a text classification model by adopting the training set to obtain a trained text classification model; and inputting the files to be classified into the trained text classification model, and predicting the label classification of the files to be classified. In the embodiment, the subject term set corresponding to the category group is automatically obtained, so that the manual extraction of the label subject terms is avoided, and the efficiency is high; and creating an inverted sorting index file tree according to the subject term set and the files corresponding to each category group, and constructing a training set based on the inverted sorting index file tree, namely automatically constructing the training set, so that the efficiency and the accuracy of constructing the training set are improved, and the label classification of the file to be classified is predicted more accurately finally.

The embodiment of the invention also provides a file label classification device, the principle of which is similar to that of a file label classification method, and details are not repeated here.

Fig. 5 is a schematic diagram of a document tag sorting apparatus according to an embodiment of the present invention, and as shown in fig. 5, the apparatus includes:

a topic word extraction module 501, configured to extract a topic word of each node in a label system tree, where the label system tree includes topic words of multiple levels of nodes in multiple files;

a first topic word generation module 502, configured to add a topic word of an upward multi-level node of each leaf node in a label system tree into one category group, so as to form a topic word set corresponding to each category group;

a second topic word generation module 503, configured to obtain a feature word set corresponding to each category group, and add a feature word in the feature word set corresponding to each category group to the topic word set corresponding to the category group, where the feature word set corresponding to each category group includes feature words of multiple files under a leaf node corresponding to the category group;

a training set constructing module 504, configured to create a reverse-ordered index file tree according to the topic word set and the multiple files corresponding to each category group, and construct a training set based on the reverse-ordered index file tree;

a model training module 505, configured to train a text classification model using the training set to obtain a trained text classification model;

and the predicting module 506 is used for inputting the file to be classified into the trained text classification model and predicting the label classification of the file to be classified.

In an embodiment, the second topic word generation module 503 is specifically configured to:

In one embodiment, the apparatus further comprises a preprocessing module 507 for:

after the content in the file under each leaf node in the label system tree is subjected to word segmentation processing to obtain a word set corresponding to each category group, stop words, high-frequency words and low-frequency words in the word set corresponding to each category group are deleted.

In one embodiment, the training set construction module 504 is specifically configured to:

merging the subject terms corresponding to all the category groups;

selecting a second set number of files from the plurality of files;

In one embodiment, the text classification model is a TextCNN model.

In summary, in the apparatus provided in the embodiment of the present invention, a topic word of each node in a label system tree is extracted, where the label system tree includes topic words of multiple levels of nodes of multiple files; adding the subject term of the upward multi-level node of each leaf node in the label system tree into a category group to form a subject term set corresponding to each category group; obtaining a feature word set corresponding to each category group, and adding feature words in the feature word set corresponding to each category group into a subject word set corresponding to the category group, wherein the feature word set corresponding to each category group comprises feature words of a plurality of files under a leaf node corresponding to the category group; creating an inverted sorting index file tree according to the subject term set and the files corresponding to each category group, and constructing a training set based on the inverted sorting index file tree; training a text classification model by adopting the training set to obtain a trained text classification model; and inputting the files to be classified into the trained text classification model, and predicting the label classification of the files to be classified. In the embodiment, the subject term set corresponding to the category group is automatically obtained, so that the manual extraction of the label subject terms is avoided, and the efficiency is high; and creating an inverted sorting index file tree according to the subject term set and the files corresponding to each category group, and constructing a training set based on the inverted sorting index file tree, namely automatically constructing the training set, so that the efficiency and the accuracy of constructing the training set are improved, and the label classification of the file to be classified is predicted more accurately finally.

An embodiment of the present application further provides a computer device, fig. 6 is a schematic diagram of the computer device in the embodiment of the present invention, the computer device is capable of implementing all steps in the method for configuring a credit line of a customer based on a knowledge graph in the embodiment, and the electronic device specifically includes the following contents:

a processor (processor)601, a memory (memory)602, a communication Interface (Communications Interface)603, and a bus 604;

the processor 601, the memory 602 and the communication interface 603 complete mutual communication through the bus 604; the communication interface 603 is used for implementing information transmission among related devices such as server-side devices, detection devices, user-side devices and the like;

the processor 601 is used to call the computer program in the memory 602, and when the processor executes the computer program, the processor implements all the steps in the method for configuring the credit line of the client based on the knowledge graph in the above embodiment.

The embodiment of the present application further provides a computer-readable storage medium, which can implement all the steps of the method for configuring a credit line of a client based on a knowledge graph in the above embodiment, and the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program implements all the steps of the method for configuring a credit line of a client based on a knowledge graph in the above embodiment.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A file label classification method is characterized by comprising the following steps:

2. The method for classifying file labels according to claim 1, wherein obtaining the feature word set corresponding to each category group comprises:

3. The method for classifying file labels according to claim 2, wherein after performing word segmentation processing on the contents in the file under each leaf node in the label system tree to obtain a word set corresponding to each category group, the method further comprises:

4. The method for classifying file labels according to claim 2, wherein selecting a first set number of words that appear in the file with high frequency from the word set corresponding to each category group as the feature word set corresponding to each category group comprises:

5. The method of claim 1, wherein creating the inverted index file tree according to the topic word set and the plurality of files corresponding to each category group comprises:

merging the subject terms corresponding to all the category groups;

selecting a second set number of files from the plurality of files;

6. The file label classification method of claim 1, wherein constructing a training set based on the inverted sorted index file tree comprises:

7. The file tag classification method of claim 1, wherein the text classification model is a TextCNN model.

8. A document label sorting apparatus, comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the method of any one of claims 1 to 7.