CN111930944B - File label classification method and device - Google Patents

File label classification method and device Download PDF

Info

Publication number
CN111930944B
CN111930944B CN202010806917.6A CN202010806917A CN111930944B CN 111930944 B CN111930944 B CN 111930944B CN 202010806917 A CN202010806917 A CN 202010806917A CN 111930944 B CN111930944 B CN 111930944B
Authority
CN
China
Prior art keywords
files
subject
word
category group
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010806917.6A
Other languages
Chinese (zh)
Other versions
CN111930944A (en
Inventor
虞樱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bank of China Ltd
Original Assignee
Bank of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bank of China Ltd filed Critical Bank of China Ltd
Priority to CN202010806917.6A priority Critical patent/CN111930944B/en
Publication of CN111930944A publication Critical patent/CN111930944A/en
Application granted granted Critical
Publication of CN111930944B publication Critical patent/CN111930944B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a file label classification method and a device, wherein the method comprises the following steps: extracting the subject term of each node in a label system tree, wherein the label system tree comprises the subject terms of multi-level nodes of a plurality of files; adding the subject terms of the upward multilevel nodes of each leaf node in the label system tree into a class group to form a subject term set corresponding to each class group; acquiring a characteristic word set corresponding to each category group, and adding characteristic words in the characteristic word set corresponding to each category group into a subject word set corresponding to the category group; creating an inverted sequence index file tree according to the subject word set and the plurality of files corresponding to each category group, and constructing a training set based on the inverted sequence index file tree; training the text classification model by adopting the training set to obtain a trained text classification model; and inputting the files to be classified into a trained text classification model, and predicting the label classification of the files to be classified. The method and the device can be used for quickly and accurately classifying the file labels.

Description

File label classification method and device
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for classifying file labels.
Background
Classification of file labels is a widely required requirement for file management at present, for example, in the process of auditing inside a commercial bank, various regulatory files formally published inside a bank system are required to be used as the basis of audit. Large banks have a large number of people, and the departments have detailed division and various business varieties, so that the scale of the regulation and system files is larger and larger. The auditor can rapidly and accurately position the required system files in a large-scale system library, and becomes a key element for improving the auditing efficiency.
At present, the file tag classification method comprises the following steps: manual methods, shallow machine learning model methods, deep machine learning model methods, and the like.
The manual method relies on manual mode to classify the labels of more than 10 ten thousand files in various file libraries one by one, and is high in time and labor waste and error rate, and the personnel are required to have professional skills. Moreover, the files in various file libraries are updated frequently, the label system is updated frequently, and the requirements of business development cannot be met by means of manual classification of label modes.
The shallow machine learning model method is an algorithm for predicting a new data target through historical data related to a prediction tag, and requires a large amount of training data; extensive manual labeling is required and relies on very specialized knowledge. In addition, the institution documents of the financial institutions are usually kept secret or locally kept secret, so that the actual operation is very difficult; these shallow machine learning algorithms all have data sparsity problems and require manual definition and selection of features, which is very expensive.
The deep learning model method is a machine learning algorithm based on a linear model, and has the advantages of simple model, high convergence rate and no feature selection. The method is suitable for data with large data quantity and less characteristics. The deep machine learning model solves the problem of sparse features in the traditional shallow machine learning method, and is directly based on a deep neural network to classify the text end to end. But this has the great disadvantage that data redundancy, a lot of data or information not related to the problem affects the final result seriously, or results in a model that is probably huge, difficult to optimize, wasteful of resources, etc.
In view of the above, a fast and accurate method for classifying file tags is currently lacking.
Disclosure of Invention
The embodiment of the application provides a file label classification method for rapidly and accurately classifying file labels, which comprises the following steps:
extracting the subject term of each node in a label system tree, wherein the label system tree comprises the subject terms of multi-level nodes of a plurality of files;
adding the subject terms of the upward multilevel nodes of each leaf node in the label system tree into a class group to form a subject term set corresponding to each class group;
the method comprises the steps of obtaining a characteristic word set corresponding to each category group, adding characteristic words in the characteristic word set corresponding to each category group into a subject word set corresponding to the category group, wherein the characteristic word set corresponding to each category group comprises characteristic words of a plurality of files under leaf nodes corresponding to the category group;
creating an inverted sequence index file tree according to the subject word set and the plurality of files corresponding to each category group, and constructing a training set based on the inverted sequence index file tree;
training the text classification model by adopting the training set to obtain a trained text classification model;
and inputting the files to be classified into a trained text classification model, and predicting the label classification of the files to be classified.
The embodiment of the application provides a file label classification device, which is used for rapidly and accurately classifying file labels and comprises the following components:
the system comprises a main topic word extraction module, a label system tree and a label system tree, wherein the main topic word extraction module is used for extracting main topic words of each node in the label system tree, and the label system tree comprises main topic words of multilevel nodes of a plurality of files;
the first subject term generation module is used for adding the subject term of the upward multilevel node of each leaf node in the label system tree into a class group to form a subject term set corresponding to each class group;
the second subject term generation module is used for obtaining a characteristic term set corresponding to each category group, adding the characteristic term in the characteristic term set corresponding to each category group into the subject term set corresponding to the category group, wherein the characteristic term set corresponding to each category group comprises characteristic terms of a plurality of files under the leaf nodes corresponding to the category group;
the training set constructing module is used for creating an inverted sequence index file tree according to the subject word set and the files corresponding to each category group and constructing a training set based on the inverted sequence index file tree;
the model training module is used for training the text classification model by adopting the training set to obtain a trained text classification model;
and the prediction module is used for inputting the files to be classified into the trained text classification model and predicting the label classification of the files to be classified.
The embodiment of the application also provides computer equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the method for classifying the file labels is realized when the processor executes the computer program.
The embodiment of the application also provides a computer readable storage medium, which stores a computer program for executing the file tag classification method.
In the embodiment of the application, the subject term of each node in a label system tree is extracted, wherein the label system tree comprises the subject terms of multi-level nodes of a plurality of files; adding the subject terms of the upward multilevel nodes of each leaf node in the label system tree into a class group to form a subject term set corresponding to each class group; the method comprises the steps of obtaining a characteristic word set corresponding to each category group, adding characteristic words in the characteristic word set corresponding to each category group into a subject word set corresponding to the category group, wherein the characteristic word set corresponding to each category group comprises characteristic words of a plurality of files under leaf nodes corresponding to the category group; creating an inverted sequence index file tree according to the subject word set and the plurality of files corresponding to each category group, and constructing a training set based on the inverted sequence index file tree; training the text classification model by adopting the training set to obtain a trained text classification model; and inputting the files to be classified into a trained text classification model, and predicting the label classification of the files to be classified. In the embodiment, the subject word set corresponding to the category group is automatically obtained, so that the manual extraction of the tag subject word is avoided, and the efficiency is high; according to the corresponding subject word set and the files of each category group, an inverted sequence index file tree is created, and a training set is constructed based on the inverted sequence index file tree, namely the training set is automatically constructed, so that the efficiency and the accuracy of constructing the training set are improved, and finally, the label classification of the predicted files to be classified is more accurate.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. In the drawings:
FIG. 1 is a flow chart of a method for classifying file tags according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a tag hierarchy tree in accordance with an embodiment of the present application;
FIG. 3 is a diagram of an inverted index file tree according to an embodiment of the present application;
FIG. 4 is a detailed flowchart of a method for classifying file tags according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a device for classifying file labels according to an embodiment of the present application;
fig. 6 is a schematic diagram of a computer device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings. The exemplary embodiments of the present application and their descriptions herein are for the purpose of explaining the present application, but are not to be construed as limiting the application.
In the description of the present specification, the terms "comprising," "including," "having," "containing," and the like are open-ended terms, meaning including, but not limited to. The description of the reference terms "one embodiment," "a particular embodiment," "some embodiments," "for example," etc., means that a particular feature, structure, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. The order of steps involved in the embodiments is illustrative of the practice of the application, and is not limited and may be suitably modified as desired.
Fig. 1 is a flowchart of a method for classifying file tags according to an embodiment of the present application, as shown in fig. 1, the method includes:
step 101, extracting the subject term of each node in a label system tree, wherein the label system tree comprises the subject terms of multi-level nodes of a plurality of files;
step 102, adding the subject terms of the upward multilevel nodes of each leaf node in the label system tree into a class group to form a subject term set corresponding to each class group;
step 103, obtaining a feature word set corresponding to each category group, adding feature words in the feature word set corresponding to each category group into a subject word set corresponding to the category group, wherein the feature word set corresponding to each category group comprises feature words of a plurality of files under leaf nodes corresponding to the category group;
104, creating an inverted sequence index file tree according to the corresponding subject word set and the plurality of files of each category group, and constructing a training set based on the inverted sequence index file tree;
step 105, training the text classification model by adopting the training set to obtain a trained text classification model;
and 106, inputting the file to be classified into a trained text classification model, and predicting the label classification of the file to be classified.
In the embodiment of the application, the subject word set corresponding to the category group is automatically obtained, so that the manual extraction of the tag subject word is avoided, and the efficiency is high; according to the corresponding subject word set and the files of each category group, an inverted sequence index file tree is created, and a training set is constructed based on the inverted sequence index file tree, namely the training set is automatically constructed, so that the efficiency and the accuracy of constructing the training set are improved, and finally, the label classification of the predicted files to be classified is more accurate.
In specific implementation, fig. 2 is a schematic diagram of a label system tree in the embodiment of the present application, and a leaf node L511111 includes a plurality of files, problem files, program files, and the like, for example, if the label system tree is directed to an internal audit system, the files are system files, such as an audit problem file, an audit data file, an audit basis file, and an audit program file, and in step 101, the subject words of each node, that is, the label subject words, are extracted.
In step 102, the keywords of the upward multilevel node of each leaf node in the label system tree are added to a class group to form a keyword set corresponding to each class group, so that each leaf node has a keyword set.
In an embodiment, obtaining a feature word set corresponding to each category group includes:
word segmentation is carried out on the content in the file under each leaf node in the label system tree, a word set corresponding to each category group is obtained, and the word set corresponding to each category group comprises word sets of a plurality of files under the leaf node corresponding to the category group;
and selecting a first set number of words which occur in the file at high frequency from word sets corresponding to each category group as feature word sets corresponding to each category group, and deleting feature words which occur in the feature word sets corresponding to each category group.
In the above embodiment, since there are multiple files under each leaf node and each leaf node has a class group, word segmentation is performed on multiple file contents under each leaf node, each file content can form a word vector, and each class group of each leaf node corresponds to a word set, in one embodiment, after word segmentation is performed on the content in the file under each leaf node in the tag system tree, to obtain a word set corresponding to each class group, the method further includes:
and deleting the stay words, the high-frequency words and the low-frequency words in the word set corresponding to each category group.
In the above embodiment, the above stay words (such as even, but equal words), high-frequency words (in, yes, equal words) and low-frequency words (unusual words) in the word set are deleted, so that the obtained words can better embody the characteristics of the file, the word quantity can be reduced, and the processing amount of the subsequent steps can be reduced.
In an embodiment, selecting, from a word set corresponding to each category group, a first set number of words that occur in a file with high frequency as a feature word set corresponding to each category group, including:
and selecting a first set number of words which occur in the file at high frequency from word sets corresponding to each category group by adopting a TF-IDF algorithm, and taking the first set number of words as a characteristic word set corresponding to each category group.
The TF-IDF algorithm is a statistical method for evaluating the importance of a word to one of a set of documents or a corpus. The importance of a word increases proportionally with the number of times it appears in the file, but at the same time decreases inversely with the frequency with which it appears in the corpus. Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of correlation between documents and user queries. The TF-IDF algorithm is proportional to the number of occurrences of a word in a document and inversely proportional to the number of occurrences of the word in the entire language. Therefore, the TF-IDF algorithm is adopted to select a first set number of words which occur in the file at high frequency, namely, the TF-IDF value of each word of the file is calculated, and then the words which are arranged at the forefront are selected according to descending order.
In step 103, the feature words in the feature word set corresponding to each category group are added to the subject word set corresponding to the category group, so that the subject word set of the tag not only comprises the subject words currently extracted from the tag system tree but also comprises the feature words screened from the file again, the correction of the subject words of the tag in the tag system tree is realized, and the accuracy of the training set constructed later is further improved.
In one embodiment, creating an inverted index file tree from a set of subject words and a plurality of files corresponding to each category group includes:
merging the subject words corresponding to all the category groups;
selecting a second set number of files from the plurality of files;
creating an inverted index file tree according to the subject word and the frequency of occurrence of the subject word in each selected file.
Fig. 3 is a schematic diagram of an inverted index file tree in the embodiment of the present application, first, the subject words corresponding to all category groups are combined to form a set of subject words 1 to N, then, a part of files, for example, files with a set number of 10% are selected from a huge file library, for example, a system file library of an internal audit system, then, the frequency of occurrence of each subject word in each file is sorted, taking the subject word 1 as an example, and the subject words 1, 5 and N are sequentially shown as the occurrence frequency, and the subject word 2 is sequentially shown as the file 1, 2 and K according to the occurrence frequency, so as to finally form an inverted index file tree.
In one embodiment, constructing a training set based on the inverted index file tree includes:
calculating the weight of each file in the inverted index file tree according to the frequency and the position of each file where the subject word appears based on the weight of each subject word;
and selecting a third set number of files from the second set number of files based on the weight of each file, and constructing a training set.
In the above embodiment, first, the weight of each subject word is determined based on empirical data or the like, for example, the weight of the subject word 1 is determined to be 20%, the weight of the subject word 2 is determined to be 10%, or the like, the frequency of occurrence of the subject word 1 in the file 1 is highest, and the occurrence positions are at the title and the head of the segment, and therefore, the weight of the file 1 is determined to be very high. Finally, selecting a third set number of files from the second set number of files according to the weight of the files, constructing a training set, and selecting 30% of the weight TOP files to construct the training set when selecting.
In an embodiment, the text classification model is a TextCNN model.
TextCNN is a machine learning model, and then the training set can be used to train the text classification model, such as the TextCNN model, to obtain a trained text classification model. And finally, inputting the file to be classified into a trained text classification model, and predicting the label classification of the file to be classified.
Based on the above embodiment, a detailed flowchart of a file tag classification method is given below, as shown in fig. 4, including:
step 401, extracting the subject term of each node in a label system tree, wherein the label system tree comprises the subject terms of multi-level nodes of a plurality of files;
step 402, adding the subject terms of the upward multilevel nodes of each leaf node in the label system tree into a class group to form a subject term set corresponding to each class group;
step 403, word segmentation processing is carried out on the content in the file under each leaf node in the label system tree, and a word set corresponding to each category group is obtained;
step 404, deleting stay words, high-frequency words and low-frequency words in the word set corresponding to each category group;
step 405, selecting a first set number of words which occur in a file at high frequency from word sets corresponding to each category group by using a TF-IDF algorithm, and deleting feature words which occur in the feature word sets corresponding to each category group as feature word sets corresponding to each category group;
step 406, adding the feature words in the feature word set corresponding to each category group into the subject word set corresponding to the category group;
step 407, merging the subject words corresponding to all the category groups;
step 408, selecting a second set number of files from the plurality of files;
step 409, creating an inverted index file tree according to the subject word and the occurrence frequency of the subject word in each selected file;
step 410, calculating the weight of each file in the reverse order index file tree according to the frequency and the position of each file where the subject word appears based on the weight of each subject word;
step 411, selecting a third set number of files from the second set number of files based on the weight of each file, and constructing a training set;
step 412, training the text classification model by using the training set to obtain a trained text classification model;
and 413, inputting the file to be classified into a trained text classification model, and predicting the label classification of the file to be classified.
Of course, it is to be understood that other embodiments are possible, and that related variations fall within the scope of the application.
In summary, in the method provided by the embodiment of the present application, the subject term of each node in the tag system tree is extracted, where the tag system tree includes the subject terms of multi-level nodes of multiple files; adding the subject terms of the upward multilevel nodes of each leaf node in the label system tree into a class group to form a subject term set corresponding to each class group; the method comprises the steps of obtaining a characteristic word set corresponding to each category group, adding characteristic words in the characteristic word set corresponding to each category group into a subject word set corresponding to the category group, wherein the characteristic word set corresponding to each category group comprises characteristic words of a plurality of files under leaf nodes corresponding to the category group; creating an inverted sequence index file tree according to the subject word set and the plurality of files corresponding to each category group, and constructing a training set based on the inverted sequence index file tree; training the text classification model by adopting the training set to obtain a trained text classification model; and inputting the files to be classified into a trained text classification model, and predicting the label classification of the files to be classified. In the embodiment, the subject word set corresponding to the category group is automatically obtained, so that the manual extraction of the tag subject word is avoided, and the efficiency is high; according to the corresponding subject word set and the files of each category group, an inverted sequence index file tree is created, and a training set is constructed based on the inverted sequence index file tree, namely the training set is automatically constructed, so that the efficiency and the accuracy of constructing the training set are improved, and finally, the label classification of the predicted files to be classified is more accurate.
The embodiment of the application also provides a file label classification device, the principle of which is similar to that of a file label classification method, and the description is omitted here.
Fig. 5 is a schematic diagram of a device for classifying file labels according to an embodiment of the present application, as shown in fig. 5, the device includes:
a topic word extraction module 501, configured to extract a topic word of each node in a tag hierarchy tree, where the tag hierarchy tree includes topic words of multiple levels of nodes of multiple files;
a first topic word generating module 502, configured to add topic words of upward multilevel nodes of each leaf node in the label system tree to a class group, to form a topic word set corresponding to each class group;
a second subject term generating module 503, configured to obtain a feature term set corresponding to each category group, add a feature term in the feature term set corresponding to each category group to the subject term set corresponding to the category group, where the feature term set corresponding to each category group includes feature terms of a plurality of files under the leaf node corresponding to the category group;
the training set constructing module 504 is configured to create an inverted index file tree according to the topic word set and the plurality of files corresponding to each category group, and construct a training set based on the inverted index file tree;
the model training module 505 is configured to train the text classification model by using the training set, and obtain a trained text classification model;
the prediction module 506 is configured to input the file to be classified into the trained text classification model, and predict the tag classification of the file to be classified.
In one embodiment, the second subject term generation module 503 is specifically configured to:
word segmentation is carried out on the content in the file under each leaf node in the label system tree, a word set corresponding to each category group is obtained, and the word set corresponding to each category group comprises word sets of a plurality of files under the leaf node corresponding to the category group;
and selecting a first set number of words which occur in the file at high frequency from word sets corresponding to each category group as feature word sets corresponding to each category group, and deleting feature words which occur in the feature word sets corresponding to each category group.
In an embodiment, the apparatus further comprises a preprocessing module 507 for:
after word segmentation is carried out on the content in the file under each leaf node in the label system tree to obtain a word set corresponding to each category group, the stay words, the high-frequency words and the low-frequency words in the word set corresponding to each category group are deleted.
In one embodiment, the second subject term generation module 503 is specifically configured to:
and selecting a first set number of words which occur in the file at high frequency from word sets corresponding to each category group by adopting a TF-IDF algorithm, and taking the first set number of words as a characteristic word set corresponding to each category group.
In one embodiment, the training set construction module 504 is specifically configured to:
merging the subject words corresponding to all the category groups;
selecting a second set number of files from the plurality of files;
creating an inverted index file tree according to the subject word and the frequency of occurrence of the subject word in each selected file.
In one embodiment, the training set construction module 504 is specifically configured to:
calculating the weight of each file in the inverted index file tree according to the frequency and the position of each file where the subject word appears based on the weight of each subject word;
and selecting a third set number of files from the second set number of files based on the weight of each file, and constructing a training set.
In an embodiment, the text classification model is a TextCNN model.
In summary, in the device provided by the embodiment of the present application, the subject term of each node in the tag system tree is extracted, where the tag system tree includes the subject terms of the multi-level nodes of the plurality of files; adding the subject terms of the upward multilevel nodes of each leaf node in the label system tree into a class group to form a subject term set corresponding to each class group; the method comprises the steps of obtaining a characteristic word set corresponding to each category group, adding characteristic words in the characteristic word set corresponding to each category group into a subject word set corresponding to the category group, wherein the characteristic word set corresponding to each category group comprises characteristic words of a plurality of files under leaf nodes corresponding to the category group; creating an inverted sequence index file tree according to the subject word set and the plurality of files corresponding to each category group, and constructing a training set based on the inverted sequence index file tree; training the text classification model by adopting the training set to obtain a trained text classification model; and inputting the files to be classified into a trained text classification model, and predicting the label classification of the files to be classified. In the embodiment, the subject word set corresponding to the category group is automatically obtained, so that the manual extraction of the tag subject word is avoided, and the efficiency is high; according to the corresponding subject word set and the files of each category group, an inverted sequence index file tree is created, and a training set is constructed based on the inverted sequence index file tree, namely the training set is automatically constructed, so that the efficiency and the accuracy of constructing the training set are improved, and finally, the label classification of the predicted files to be classified is more accurate.
An embodiment of the present application further provides a computer device, and fig. 6 is a schematic diagram of the computer device in the embodiment of the present application, where the computer device can implement all the steps in the method for configuring a credit card limit of a customer based on a knowledge graph in the foregoing embodiment, and the electronic device specifically includes the following contents:
a processor (processor) 601, a memory (memory) 602, a communication interface (Communications Interface) 603, and a bus 604;
wherein the processor 601, the memory 602, and the communication interface 603 complete communication with each other through the bus 604; the communication interface 603 is configured to implement information transmission between related devices such as a server device, a detection device, and a user device;
the processor 601 is configured to invoke a computer program in the memory 602, where the processor executes the computer program to implement all the steps in the method for configuring a credit card credit line based on a knowledge graph in the above embodiment.
The embodiment of the present application further provides a computer readable storage medium, which can implement all the steps in the method for configuring the credit card amount of the client based on the knowledge graph in the above embodiment, and the computer readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program implements all the steps in the method for configuring the credit card amount of the client based on the knowledge graph in the above embodiment.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the application, and is not meant to limit the scope of the application, but to limit the application to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the application are intended to be included within the scope of the application.

Claims (9)

1. A method for classifying a file tag, comprising:
extracting the subject term of each node in a label system tree, wherein the label system tree comprises the subject terms of multi-level nodes of a plurality of files;
adding the subject terms of the upward multilevel nodes of each leaf node in the label system tree into a class group to form a subject term set corresponding to each class group;
the method comprises the steps of obtaining a characteristic word set corresponding to each category group, adding characteristic words in the characteristic word set corresponding to each category group into a subject word set corresponding to the category group, wherein the characteristic word set corresponding to each category group comprises characteristic words of a plurality of files under leaf nodes corresponding to the category group;
creating an inverted sequence index file tree according to the subject word set and the plurality of files corresponding to each category group, and constructing a training set based on the inverted sequence index file tree;
training the text classification model by adopting the training set to obtain a trained text classification model;
inputting the files to be classified into a trained text classification model, and predicting label classification of the files to be classified;
creating an inverted index file tree according to the subject word set and the plurality of files corresponding to each category group, wherein the inverted index file tree comprises:
merging the subject words corresponding to all the category groups;
selecting a second set number of files from the plurality of files;
creating an inverted index file tree according to the subject word and the frequency of occurrence of the subject word in each selected file.
2. The method for classifying a document tag according to claim 1, wherein obtaining a feature word set corresponding to each category group comprises:
word segmentation is carried out on the content in the file under each leaf node in the label system tree, a word set corresponding to each category group is obtained, and the word set corresponding to each category group comprises word sets of a plurality of files under the leaf node corresponding to the category group;
and selecting a first set number of words which occur in the file at high frequency from word sets corresponding to each category group as feature word sets corresponding to each category group, and deleting feature words which occur in the feature word sets corresponding to each category group.
3. The method for classifying file tags according to claim 2, wherein after performing word segmentation on the content in the file under each leaf node in the tag system tree to obtain the word set corresponding to each class group, the method further comprises:
and deleting the stay words, the high-frequency words and the low-frequency words in the word set corresponding to each category group.
4. The method of classifying a document tag of claim 2, wherein selecting a first set number of words that occur frequently in the document from the set of words corresponding to each category group as the set of feature words corresponding to each category group comprises:
and selecting a first set number of words which occur in the file at high frequency from word sets corresponding to each category group by adopting a TF-IDF algorithm, and taking the first set number of words as a characteristic word set corresponding to each category group.
5. The file tag classification method of claim 1, wherein constructing a training set based on the inverted index file tree comprises:
calculating the weight of each file in the inverted index file tree according to the frequency and the position of each file where the subject word appears based on the weight of each subject word;
and selecting a third set number of files from the second set number of files based on the weight of each file, and constructing a training set.
6. The document tag classification method of claim 1, wherein the text classification model is a TextCNN model.
7. A document tag sorting apparatus, comprising:
the system comprises a main topic word extraction module, a label system tree and a label system tree, wherein the main topic word extraction module is used for extracting main topic words of each node in the label system tree, and the label system tree comprises main topic words of multilevel nodes of a plurality of files;
the first subject term generation module is used for adding the subject term of the upward multilevel node of each leaf node in the label system tree into a class group to form a subject term set corresponding to each class group;
the second subject term generation module is used for obtaining a characteristic term set corresponding to each category group, adding the characteristic term in the characteristic term set corresponding to each category group into the subject term set corresponding to the category group, wherein the characteristic term set corresponding to each category group comprises characteristic terms of a plurality of files under the leaf nodes corresponding to the category group;
the training set constructing module is used for creating an inverted sequence index file tree according to the subject word set and the files corresponding to each category group and constructing a training set based on the inverted sequence index file tree;
the model training module is used for training the text classification model by adopting the training set to obtain a trained text classification model;
the prediction module is used for inputting the files to be classified into the trained text classification model and predicting the label classification of the files to be classified;
creating an inverted index file tree according to the subject word set and the plurality of files corresponding to each category group, wherein the inverted index file tree comprises:
merging the subject words corresponding to all the category groups;
selecting a second set number of files from the plurality of files;
creating an inverted index file tree according to the subject word and the frequency of occurrence of the subject word in each selected file.
8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1 to 6 when executing the computer program.
9. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program for executing the method of any one of claims 1 to 6.
CN202010806917.6A 2020-08-12 2020-08-12 File label classification method and device Active CN111930944B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010806917.6A CN111930944B (en) 2020-08-12 2020-08-12 File label classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010806917.6A CN111930944B (en) 2020-08-12 2020-08-12 File label classification method and device

Publications (2)

Publication Number Publication Date
CN111930944A CN111930944A (en) 2020-11-13
CN111930944B true CN111930944B (en) 2023-08-22

Family

ID=73311209

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010806917.6A Active CN111930944B (en) 2020-08-12 2020-08-12 File label classification method and device

Country Status (1)

Country Link
CN (1) CN111930944B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328798A (en) * 2020-11-27 2021-02-05 中国银联股份有限公司 Text classification method and device
CN113326239A (en) * 2021-06-24 2021-08-31 长江存储科技有限责任公司 File management method, device, equipment and computer readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107577785A (en) * 2017-09-15 2018-01-12 南京大学 A kind of level multi-tag sorting technique suitable for law identification
CN109325122A (en) * 2018-09-17 2019-02-12 深圳市牛鼎丰科技有限公司 Vocabulary generation method, file classification method, device, equipment and storage medium
CN110188199A (en) * 2019-05-21 2019-08-30 北京鸿联九五信息产业有限公司 A kind of file classification method for intelligent sound interaction

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100169243A1 (en) * 2008-12-27 2010-07-01 Kibboko, Inc. Method and system for hybrid text classification

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107577785A (en) * 2017-09-15 2018-01-12 南京大学 A kind of level multi-tag sorting technique suitable for law identification
CN109325122A (en) * 2018-09-17 2019-02-12 深圳市牛鼎丰科技有限公司 Vocabulary generation method, file classification method, device, equipment and storage medium
CN110188199A (en) * 2019-05-21 2019-08-30 北京鸿联九五信息产业有限公司 A kind of file classification method for intelligent sound interaction

Also Published As

Publication number Publication date
CN111930944A (en) 2020-11-13

Similar Documents

Publication Publication Date Title
US10410138B2 (en) System and method for automatic generation of features from datasets for use in an automated machine learning process
CN107967575B (en) Artificial intelligence platform system for artificial intelligence insurance consultation service
CN107861951A (en) Session subject identifying method in intelligent customer service
US11636341B2 (en) Processing sequential interaction data
CN108073568A (en) keyword extracting method and device
CN107844533A (en) A kind of intelligent Answer System and analysis method
CN104077417A (en) Figure tag recommendation method and system in social network
CN109740642A (en) Invoice category recognition methods, device, electronic equipment and readable storage medium storing program for executing
EP3608802A1 (en) Model variable candidate generation device and method
CN111930944B (en) File label classification method and device
CN112836509A (en) Expert system knowledge base construction method and system
CN114077661A (en) Information processing apparatus, information processing method, and computer readable medium
CN116049379A (en) Knowledge recommendation method, knowledge recommendation device, electronic equipment and storage medium
CN110310012B (en) Data analysis method, device, equipment and computer readable storage medium
CN110569355B (en) Viewpoint target extraction and target emotion classification combined method and system based on word blocks
CN111709225A (en) Event cause and effect relationship judging method and device and computer readable storage medium
CN111612519A (en) Method, device and storage medium for identifying potential customers of financial product
US20170220665A1 (en) Systems and methods for merging electronic data collections
CN113256383B (en) Recommendation method and device for insurance products, electronic equipment and storage medium
Hong et al. Determining construction method patterns to automate and optimise scheduling–a graph-based approach
CN110941952A (en) Method and device for perfecting audit analysis model
CN111538898B (en) Web service package recommendation method and system based on combined feature extraction
CN115495587A (en) Alarm analysis method and device based on knowledge graph
US11880394B2 (en) System and method for machine learning architecture for interdependence detection
Liu et al. Research on personalized recommendation of higher education resources based on multidimensional association rules

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant