CN117668333A - File classification method, system, equipment and readable storage medium - Google Patents

File classification method, system, equipment and readable storage medium Download PDF

Info

Publication number
CN117668333A
CN117668333A CN202410138691.5A CN202410138691A CN117668333A CN 117668333 A CN117668333 A CN 117668333A CN 202410138691 A CN202410138691 A CN 202410138691A CN 117668333 A CN117668333 A CN 117668333A
Authority
CN
China
Prior art keywords
files
classification
classified
sorted
sorter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410138691.5A
Other languages
Chinese (zh)
Inventor
王舸
岑忠培
赵雪峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Quante Evolution Technology Co ltd
Original Assignee
Beijing Quante Evolution Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Quante Evolution Technology Co ltd filed Critical Beijing Quante Evolution Technology Co ltd
Priority to CN202410138691.5A priority Critical patent/CN117668333A/en
Publication of CN117668333A publication Critical patent/CN117668333A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a file classification method, a system, equipment and a readable storage medium, wherein the method comprises the following steps: determining a corresponding sorter group based on attributes of files to be sorted, wherein the attributes comprise pictures, documents and tables; invoking the sorter group to sort the files to be sorted to obtain sorted files and unclassified files; and calling a classification correction model to carry out secondary confirmation on the classified files to obtain classified files after secondary confirmation, and carrying out accurate classification on file data with various attributes as file classification results to improve the range and efficiency of file classification.

Description

File classification method, system, equipment and readable storage medium
Technical Field
The present invention relates to the field of language processing, and in particular, to a method, a system, an apparatus, and a readable storage medium for classifying files.
Background
Current document classification schemes often suffer from a common problem in that they often can only cope with single-scene or specific-modality problems, lacking a comprehensive and architectural system. This means that existing schemes may appear limited in processing diverse and complex data. The lack of an integrated taxonomy may result in the need to employ multiple different taxonomies simultaneously in applications involving multiple modalities, multiple scenarios, increasing complexity and management difficulty of the system.
Therefore, a comprehensive and unified file classification scheme is urgently needed, various data types and application scenes can be effectively processed, and the accuracy and universality of overall classification are improved.
Disclosure of Invention
The invention mainly aims to provide a file classification method, a system, equipment and a readable storage medium, which are used for accurately classifying file data with various attributes and improving the range and efficiency of file classification.
In order to achieve the above purpose, the embodiment of the present application provides the following technical solutions:
according to a first aspect of embodiments of the present application, there is provided a method for classifying files, the method including:
determining a corresponding sorter group based on attributes of files to be sorted, wherein the attributes comprise pictures, documents and tables;
invoking the sorter group to sort the files to be sorted to obtain sorted files and unclassified files;
and calling a classification correction model to carry out secondary confirmation on the classified files to obtain classified files after secondary confirmation, and taking the classified files as file classification results.
Optionally, the sorter group includes a primary sorter, a secondary sorter, and a tertiary sorter;
the step of calling the sorter group to sort the files to be sorted to obtain sorted files and unclassified files, which comprises the following steps:
Calling the first-level sorter to sort the files to be sorted, and classifying the files which are sorted successfully into sorted files;
invoking the secondary sorter to sort the files which are failed to be sorted in the primary sorter, and classifying the files which are successfully sorted into sorted files;
and calling the three-stage sorter to sort the files which are failed to be sorted in the two-stage sorter, classifying the files which are successfully sorted into sorted files, and classifying the files which are failed to be sorted into unclassified files.
Optionally, determining the corresponding sorter group based on the attribute of the file to be sorted includes:
according to the attribute of the file to be classified, the corresponding sorter group is an image classification model, and the image classification model is used for identifying objects or scenes in the image and classifying the objects or scenes;
according to the attribute of the file to be classified, the corresponding sorter group is a text classification model, and the text classification model is used for carrying out semantic analysis on the theme of the file and classifying the theme of the file;
and according to the attribute of the file to be classified, taking the corresponding sorter group as a statistical model, wherein the statistical model is used for carrying out probability estimation on the table data and classifying the table data.
Optionally, the training process of each model in the sorter group includes:
performing feature extraction on the data source samples aiming at the data source samples of each attribute to obtain a sample feature set;
marking the sample feature set to obtain a marked data set;
training based on the labeling data set to obtain a trained model.
Optionally, the calling the classification correction model performs secondary confirmation on the classified file to obtain a classified file after secondary confirmation, including:
calling a large language model to screen files which do not belong to the corresponding classification from the classified files;
and calling the large language model to classify the files which do not belong to the corresponding classification, so as to obtain classified files after secondary confirmation.
Optionally, the calling the large language model to screen files not belonging to the corresponding classification from the classified files includes:
for any one of the classified files, invoking the large language model to calculate the distance between the text vector corresponding to the file and the center of the classified file;
acquiring a plurality of files according to the sequence from the near to the far;
calling the large language model to obtain type definition according to the files;
Traversing each file in the plurality of files to obtain files belonging to the type definition, and determining the rest files as files not belonging to the corresponding classification.
Optionally, before the determining the corresponding sorter group based on the attribute of the file to be sorted, the method further includes:
and performing multi-mode processing and format alignment processing based on the attributes of the files to be classified.
According to a second aspect of embodiments of the present application, there is provided a document classification system, the system comprising:
the sorter determination module is used for determining a corresponding sorter group based on the attribute of the file to be sorted, wherein the attribute comprises a picture, a document and a table;
the primary classification module is used for calling the sorter group to classify the files to be classified to obtain classified files and unclassified files;
and the secondary classification module is used for calling a classification correction model to carry out secondary confirmation on the classified files to obtain classified files after secondary confirmation, and the classified files are used as file classification results.
According to a third aspect of embodiments of the present application, there is provided an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the computer program to perform the method of the first aspect.
According to a fourth aspect of embodiments of the present application, there is provided a computer readable storage medium having stored thereon computer readable instructions executable by a processor to implement the method of the first aspect described above.
In summary, the embodiments of the present application provide a method, a system, an apparatus, and a readable storage medium for classifying files, by determining a corresponding sorter group based on attributes of files to be classified, where the attributes include pictures, documents, and tables; invoking the sorter group to sort the files to be sorted to obtain sorted files and unclassified files; and calling a classification correction model to carry out secondary confirmation on the classified files to obtain classified files after secondary confirmation, and carrying out accurate classification on file data with various attributes as file classification results to improve the range and efficiency of file classification.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the structures shown in these drawings without inventive effort for a person skilled in the art.
The structures, proportions, sizes, etc. shown in the present specification are shown only for the purposes of illustration and description, and are not intended to limit the scope of the invention, which is defined by the claims, so that any structural modifications, changes in proportions, or adjustments of sizes, which do not affect the efficacy or the achievement of the present invention, should fall within the scope of the invention.
Fig. 1 is a schematic flow chart of a file classification method according to an embodiment of the present application;
FIG. 2 is a flowchart of a training phase of a multi-level classification matrix according to an embodiment of the present application;
FIG. 3 is a flow chart of a sorting phase of a multi-level classification matrix provided in an embodiment of the present application;
FIG. 4 is a block diagram of a file classification system according to an embodiment of the present application;
fig. 5 shows a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 6 shows a schematic diagram of a computer-readable storage medium according to an embodiment of the present application.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that all directional indicators (such as up, down, left, right, front, and rear … …) in the embodiments of the present invention are merely used to explain the relative positional relationship, movement, etc. between the components in a particular posture (as shown in the drawings), and if the particular posture is changed, the directional indicator is changed accordingly.
Furthermore, descriptions such as those referred to as "first," "second," and the like, are provided for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implying an order of magnitude of the indicated technical features in the present disclosure. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
In the present invention, unless specifically stated and limited otherwise, the terms "connected," "affixed," and the like are to be construed broadly, and for example, "affixed" may be a fixed connection, a removable connection, or an integral body; can be mechanically or electrically connected; either directly or indirectly, through intermediaries, or both, may be in communication with each other or in interaction with each other, unless expressly defined otherwise. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
In addition, the technical solutions of the embodiments of the present invention may be combined with each other, but it is necessary to be based on the fact that those skilled in the art can implement the technical solutions, and when the technical solutions are contradictory or cannot be implemented, the combination of the technical solutions should be considered as not existing, and not falling within the scope of protection claimed by the present invention.
Although the existing file classification schemes achieve a certain achievement in solving the single-scene or single-mode problem, the existing file classification schemes still have some remarkable defects, so that the wide application of the existing file classification schemes in practical application is limited.
1. Existing schemes often have difficulty accommodating the multi-modal nature of the data. With the progress of technology, data sources become more diverse, covering various modes such as text, image, audio, etc. Traditional file classification schemes may fail in processing these mixed-modality data because they tend to focus on the processing of specific data types and fail to effectively capture and fuse the characteristics of the multi-modality data.
2. In the field of data science, default training samples typically have a tremendous impact on the training results of the model and are considered sufficiently reliable and accurate. However, this assumption is not always true during the actual training process. Training data often has a large number of label errors that require sufficient caution to be maintained with the training samples to ensure that training of the model can benefit from high quality, accurate data information. In practice, therefore, it must be appreciated that the quality of the training samples may be unsatisfactory, with corresponding measures taken to handle possible label errors to improve the robustness and generalization performance of the model. This deliberate manipulation of the training data is critical to ensure reliability of the model in the face of real world complexity.
3. Lack of uniform standards and specifications is also a limiting factor. The requirements of different fields and different application scenes on file classification are quite different, so that a large number of customized schemes are caused, and the whole is difficult to coordinate and unify. This phenomenon of fragmentation limits interoperability between different systems and also increases the difficulty of system upgrades and maintenance.
4. Current solutions exhibit some computational and storage pressure in the face of large-scale and high-dimensional data. With the increasing size of data, conventional classification algorithms and models may face the problem of reduced efficiency, while in the case of high-dimensional data, conventional methods may not fully mine the features of the data, resulting in reduced classification results.
5. Processing for uncertainty and dynamics is lacking. In practical applications, uncertainty and dynamic change of data are common situations, however, the existing schemes often cannot flexibly cope with the changes, and instability and insufficient reliability of classification results are caused.
Therefore, in order to solve these drawbacks, development of a more comprehensive and adaptive file classification scheme is needed to cope with diverse data and complex application scenarios.
Fig. 1 shows a file classification method provided in an embodiment of the present application, where the method includes:
step 101: determining a corresponding sorter group based on attributes of files to be sorted, wherein the attributes comprise pictures, documents and tables;
step 102: invoking the sorter group to sort the files to be sorted to obtain sorted files and unclassified files;
step 103: and calling a classification correction model to carry out secondary confirmation on the classified files to obtain classified files after secondary confirmation, and taking the classified files as file classification results.
In a possible implementation manner, before step 101, the method further includes:
and performing multi-mode processing and format alignment processing based on the attributes of the files to be classified.
In a possible implementation manner, in step 101, determining a corresponding sorter group based on the attribute of the file to be sorted includes:
according to the attribute of the file to be classified, the corresponding sorter group is an image classification model, and the image classification model is used for identifying objects or scenes in the image and classifying the objects or scenes;
according to the attribute of the file to be classified, the corresponding sorter group is a text classification model, and the text classification model is used for carrying out semantic analysis on the theme of the file and classifying the theme of the file;
And according to the attribute of the file to be classified, taking the corresponding sorter group as a statistical model, wherein the statistical model is used for carrying out probability estimation on the table data and classifying the table data.
In one possible embodiment, the sorter group includes a primary sorter, a secondary sorter, and a tertiary sorter.
In a possible implementation manner, in step 102, the calling the sorter group to sort the files to be sorted to obtain sorted files and unclassified files includes:
calling the first-level sorter to sort the files to be sorted, and classifying the files which are sorted successfully into sorted files; invoking the secondary sorter to sort the files which are failed to be sorted in the primary sorter, and classifying the files which are successfully sorted into sorted files; and calling the three-stage sorter to sort the files which are failed to be sorted in the two-stage sorter, classifying the files which are successfully sorted into sorted files, and classifying the files which are failed to be sorted into unclassified files.
In one possible embodiment, the training process of each model in the sorter group includes:
performing feature extraction on the data source samples aiming at the data source samples of each attribute to obtain a sample feature set; marking the sample feature set to obtain a marked data set; training based on the labeling data set to obtain a trained model.
In a possible implementation manner, in step 103, the calling the classification correction model to perform secondary confirmation on the classified file to obtain a classified file after secondary confirmation includes:
calling a large language model to screen files which do not belong to the corresponding classification from the classified files; and calling the large language model to classify the files which do not belong to the corresponding classification, so as to obtain classified files after secondary confirmation.
In one possible implementation, the invoking the large language model to screen files from the classified files that do not belong to the respective classifications includes:
for any one of the classified files, invoking the large language model to calculate the distance between the text vector corresponding to the file and the center of the classified file; acquiring a plurality of files according to the sequence from the near to the far; calling the large language model to obtain type definition according to the files; traversing each file in the plurality of files to obtain files belonging to the type definition, and determining the rest files as files not belonging to the corresponding classification.
The file classification method provided in the embodiment of the present application is explained in detail below. The device specifically comprises the following modules:
1. The pre-processing module: and performing multi-mode processing and format alignment processing based on the attributes of the files to be classified.
The complexity of the sample data to be sorted is not only represented by dimensions of size, number, source and the like, but also includes multi-modal data types such as images, characters and voices, and mixtures of different formats (jpg, png, MP, word and excel) and multiple languages (Chinese and English). This places extremely high demands on the relevant modules of the sorting system. In order to reduce the sorting difficulty, a series of pre-processing is needed, including technical means such as multi-mode processing and complex format alignment, so that the system is more suitable for the processing requirement of diversified data.
Multi-modal processing: the multi-modal fusion technology in deep learning is introduced, so that the system can process data of various modes such as images, characters, voice and the like more finely. This will significantly enhance the comprehensive expressive power of the data, allowing the system to more fully understand and utilize the multimodal information.
Complex format alignment: an adaptive alignment algorithm is introduced, and through an intelligent alignment process, distortion and loss of information among data in different formats are avoided, so that the integrity and usability of the data are ensured. This step will achieve a higher level of data quality and consistency during the data preprocessing stage.
The multi-modal processing is aligned to complex formats: the processing capability of the system to various modal data such as images, characters, voice and the like is improved, and the expression capability of the data is enhanced. The complex format alignment ensures the intelligent alignment of data with different formats through the self-adaptive alignment algorithm, and improves the data integrity and usability.
2. Training sample classification label adaptive correction:
training samples have an important impact on the training results of the model, and there may be a large number of label errors in the training data. A key optimization measure, namely a training sample classification label adaptive correction algorithm, is proposed, and aims to deal with possible label errors in training data. Ensuring that the model's understanding of the tag remains accurate and robust during the training process. This innovative approach helps to improve the quality of the training data, thereby enhancing the robustness and generalization performance of the model.
Error category automatic identification algorithm: an automatic error category identification algorithm is introduced, which can identify unreasonable classification categories.
Error category automatic correction algorithm: an error type automatic correction algorithm is introduced, and according to the error sample classification and related information which are automatically identified before, the information is used for obtaining more accurate and scientific correct classification types by using the automatic correction algorithm.
Specifically, a set of training samples is defined as follows:
training samples contain features and labels that are often manually labeled, but may be incorrect due to human negligence or ambiguity in classification criteria. The following algorithm was designed to correct:
randomly sampling in the training sample set D, D '=sample (D, sample_size), wherein sample_size is the sample size of the random sampling, resulting in a training sample subset D'; samples are sequentially extracted from D', and similarity calculation is performed on all other samples in the rest D.
The similarity calculation here is described here with emphasis on natural language class samples due to the difference in data modalities:
S(x i ,x j ) Wherein x is i Is the sample taken from D', x j Is a sample taken from D; traversing all D, a similarity matrix SIM is obtained, where SIM { i, j } = S (x i ,x j )。
The similarity () here is not a fixed algorithm, but may employ different algorithms according to different situations. A common method for computing word vector similarity in natural language is to use cosine similarity (Cosine Similarity). The method evaluates the similarity of two word vectors by calculating the cosine value of the included angle between them based on the spatial geometry of the word vector representation.
Let two words A and B be respectively represented as word vectorsTheir cosine similarity S can be calculated by the following formula:
where, # represents the dot product of the vectors,representing the norm of the vector v. The cosine similarity has a value range of [ -1,1][−1,1]The closer the value is to 1, the higher the similarity of the two words, the closer the value is to-1, the lower the similarity, and the value of 0 indicates that the two word vectors are orthogonal.
This method is widely used in natural language processing because it is simple and efficient and captures semantic relationships between words. The relative position of the word vector in the vector space can be quantized by calculating the cosine similarity, so that the semantic similarity among words is measured, and basic support is provided for tasks such as text similarity and information retrieval.
Word vector (Word vector) techniques are typically trained using neural network models. Word2Vec and GloVe are two common Word vector models. The following are their mathematical expressions:
a. Word2Vec:
in the Skip-gram model, given the center word c and the context word o, the training goal is to maximize their conditional probabilities:
wherein v is o And v c Word vectors representing the context word o and the center word c, respectively, V being the whole word set.
Continuous Bag of Words (CBOW) model, given a context word o and a center word c, the training goal is to maximize the conditional probability of the center word c:
where contextcontext represents all words in the context.
b. GloVe (Global Vectors for Word Representation):
The goal of GloVe is to learn global vocabulary statistics with the loss function as follows:
wherein P is ij Representing the probability of co-occurrence of word i and word j,word vectors for word i and word j, respectively.
These formulas demonstrate the core concept in the Word2Vec and GloVe models, i.e., learning Word vectors by maximizing or minimizing some probability or loss function. These vectors capture semantic relationships between words such that words of similar meaning are more closely related in vector space.
In addition, similarity needs to be calculated for label of each sample, and since the class is one discrete type of data, similarity of labels is calculated by using a tree model.
In a tree model, the distance between nodes is typically represented by a Path Length (Path Length). The path length refers to the number of edges on the shortest path from the root node of the tree to the two target nodes. For an inclusionNA tree of individual nodes, which can define nodesiSum node jThe path length between them isdij
In this case, the distance between two tag nodes i and j can be expressed using the following mathematical formula:
Distance(i,j)=dij
this distance value represents a nodeiSum nodejThe shortest path length between them. In tree models such as decision trees, the path length between nodes is typically determined using partitioned paths in feature space. In particular, the path length may be calculated by counting the number of edges on the tree from the root node to the target node.
In a tree model, common distance metrics include path length and depth of the tree. These metrics may be used to calculate the relative positions between nodes for hierarchical classification.
Calculating the difference between the two matrices of SIM/Distance allows the ratio of the feature Distance to the tag Distance for each sample to be calculated, the overall ratio if very small, indicating that two very close samples are classified into different categories, indicating that a sample is misclassified.
3. Multilayer class classification model matrix:
the lack of single model processing power is one of the bottlenecks faced by current machine learning models. This is manifested in that they generally can only handle classification tasks of a single modality or specific scenario, while different models present challenges in terms of both classification efficiency and accuracy. Some simple samples may only need to be classified using regular expressions, while for more complex samples, LLM is relied upon for semantic analysis. Therefore, a multi-level classification model matrix needs to be constructed, and a resource scheduling module is combined to complete the classification work more comprehensively and efficiently, so that the intelligent processing capacity of the system on multiple samples is improved. The model matrix combines the AutoML technology and the flexible computing resource scheduling, improves the classification effect, flexibly distributes computing resources under dynamic load, and maximizes the system throughput.
And (3) resource scheduling: by introducing an elastic computing resource scheduling strategy, the system can flexibly and dynamically allocate computing resources according to the current real-time data load and the change of model complexity so as to achieve optimal system throughput. This enables the system to maintain a high degree of adaptability and efficiency in the face of different workloads and demands.
Model routing: the system introduces an intelligent model routing mechanism, which can automatically carry out a classification decision process. The mechanism can intelligently select the most suitable classification model for processing by considering the characteristics, complexity, real-time system load and other factors of the current sample. The system not only improves the overall classification efficiency, but also has more flexibility, and can better adapt to challenges brought by different scenes. The intelligent model routing not only provides efficient decision making capability for the system, but also enhances the overall robustness of the system. The mechanism realizes an intelligent classification decision flow and dynamically selects the most suitable classification model. The online learning capability allows the system to continuously optimize the accuracy of model selection.
The module encompasses multiple machine learning models including, but not limited to, COR models for pictorial text extraction, image classification models that directly classify images, and LDA, LSA, etc. models that classify natural language. In addition, large language models have been introduced as key components. This rich model combination provides the system with a variety of learning and decision making capabilities to more fully and deeply understand and process different data types and tasks. The comprehensive model combination enables the system to be more adaptive, and can effectively solve the problems of multiple fields and multiple modes, thereby improving the overall performance and the application range.
The multi-level classification matrix comprises two main phases.
Fig. 2 illustrates a training phase of a multi-level classification matrix provided by an embodiment of the present application.
And according to different data sources, obtaining a marked data set through a data marking module. And training the marked data set to obtain N sorting models. At this stage, the system trains on the classification data, generating a series of sorting models, including image classification, text classification, statistical models, and so forth. The goal of this stage is to enable the model to accurately classify different types of information by learning the characteristics and patterns of the data. The data sources include PDF/PNG, which are classified into non-semantic data (e.g., palmprint, fingerprint) and other semantic data. Word/txt, jason/excel are also included. And extracting information from the data source, marking the data based on the manual marking result, and generating sample data comprising the marking based on the synthetic data model. Training a sorting model by using the sample data with the labels, wherein the sorting model comprises an image sorting model, a text sorting model and a statistical model.
Fig. 3 illustrates a sorting stage of a multi-level classification matrix provided by an embodiment of the present application. The primary sorter is deployed on the same machine for primary screening, the secondary sorter is used for secondary screening, the tertiary sorter is used for final screening, the primary sorter is high in efficiency, and the tertiary sorter is high in accuracy. Also known as the hierarchical classification phase. At this stage, the system uses the model generated during the training stage to hierarchically classify various different types of data. The process is a key link for comprehensively analyzing and distinguishing the input data by the system, so that the hierarchical classification of the data is realized. The accuracy and effect of the hierarchical classification is directly affected by the quality of the model in the training stage, so that the two stages are interdependent, and together form the complete workflow of the multi-layer hierarchical model. Such a design enables the system to be more flexible and efficient in coping with different data types and tasks.
The multi-level classification model matrix encompasses multiple classification methods, each of which performs a different function during the training and sorting stages. The following is a brief introduction to several common classification models:
image classification model: such models focus on processing image data, training through Convolutional Neural Network (CNN), res net, etc. structures. During the training phase, the model learns features extracted from the image, such as texture, shape, etc. In the sorting stage, the model can classify new images, identifying objects or scenes in the images.
Text classification model: text classification models are used primarily to process text data, and are trained using structures such as transformers (transformers), LSAs, LDAs, and the like. During the training phase, the model learns the semantic and structural features in the text. During the sorting stage, the model can classify text, such as marking the topic of the document or emotion analysis.
Statistical model: the statistical model is based on statistical distribution and probability theory of data and is suitable for specific types of data distribution. In the training stage, the model learns the statistical rule of the data. In the sorting phase, the model uses these rules to estimate and sort the probability of new data, which is particularly advantageous when processing continuous data.
These classification models are learned with the tag data during the training phase, extract features of different types of data, and then used to classify new data during the sorting phase. Such a multi-model combination allows for a more comprehensive and thorough understanding and processing of various data types, improving the adaptability and accuracy of the system.
4. The classification basis management module:
in cases where the classification basis is ambiguous, some classification categories may appear relatively abstract, lacking in an explicit definition, which can be confusing to intelligent sorters. It is not clear which class the sample should be classified into may lead to classification errors. Therefore, it is important to construct a classification basis management module, definition of different classifications is defined through the module, more accurate and clear guidance is provided for the system, and rationality and stability of a classification system are ensured.
Well-defined classification basis: by combining domain expert knowledge, natural language processing and knowledge graph technology, the system establishes a clear and dynamic classification basis, and provides more interpretability and adjustability for classification decision. This will provide a deeper and comprehensive understanding capability for the system.
Intelligent learning: with reinforcement learning algorithms, the system will gradually have the ability to learn and adapt autonomously. By continuously optimizing the classification basis, the system can realize self-evolution, adapt to new scenes and changes, and improve the intelligence of decision making.
Classification basis management module and intelligent learning: well-defined classification is based on a combination of domain expert knowledge and natural language processing techniques, which makes the classification more interpretable and adaptable. The intelligent learning enables the system to have autonomous learning and adaptability through the reinforcement learning algorithm.
5. System integration and optimization:
module cooperation: the system will be designed as a modular architecture, utilizing message queuing and micro-service techniques, to ensure efficient co-operation of the various modules. This design will make the system easier to expand and maintain, improving overall performance.
Performance monitoring and adjustment: the integrated monitoring instrument board, the system will automatically adjust and optimize through real-time monitoring performance index. This helps ensure that the system remains efficient in operation under high load and complex scenarios, achieving long-term reliability.
The module collaborative design and the performance monitoring and adjusting technology realize the efficient collaborative work of each module of the system and ensure the overall performance of the system. The performance is adjusted in real time through the monitoring instrument board, so that the system can be ensured to operate efficiently under different scenes.
6. Data security classification
In the classified data, the classification concept is introduced to classify the security level of the data. This may include sensitivity to separate the data into different layers, ensuring that data of a high security level is more severely protected. This is important for intelligent sorting, especially where sensitive information is involved.
a. Defining a security hierarchy: and formulating an explicit data security grading standard, and dividing according to factors such as sensitivity, privacy and the like of the data. The data is classified into different levels such as public information, internal information, confidential information, and the like.
b. Establishing a classification-to-hierarchy mapping: a mapping relationship with the data security level is defined for each data class. It is determined which data classifications correspond to which security levels.
c. Making a mapping management strategy: and designing a mapping management strategy to ensure the accuracy and consistency of the mapping relation. Rules or criteria are considered to be formulated to ensure that all data classifications have a corresponding level of security.
d. Realizing the automatic management of the mapping relation: consider the use of an automated tool or system to manage the mapping of classifications to hierarchies. Automatic updating and adjustment of the mapping relation are realized to cope with the change of data classification and security standard.
In summary, the embodiments of the present application provide a file classification method, by determining a corresponding sorter group based on attributes of a file to be classified, where the attributes include a picture, a document, and a table; invoking the sorter group to sort the files to be sorted to obtain sorted files and unclassified files; and calling a classification correction model to carry out secondary confirmation on the classified files to obtain classified files after secondary confirmation, and carrying out accurate classification on file data with various attributes as file classification results to improve the range and efficiency of file classification.
Based on the same technical concept, the embodiment of the application further provides a file classification system, as shown in fig. 4, where the system includes:
a sorter determination module 401, configured to determine a corresponding sorter group based on attributes of the files to be sorted, where the attributes include a picture, a document, and a table;
a primary classification module 402, configured to invoke the sorter group to classify the file to be classified, to obtain a classified file and an unclassified file;
and the secondary classification module 403 is configured to call a classification correction model to perform secondary confirmation on the classified file, and obtain a classified file after secondary confirmation, which is used as a file classification result.
The embodiment of the application also provides electronic equipment corresponding to the method provided by the embodiment. Referring to fig. 5, a schematic diagram of an electronic device according to some embodiments of the present application is shown. The electronic device 20 may include: a processor 200, a memory 201, a bus 202 and a communication interface 203, the processor 200, the communication interface 203 and the memory 201 being connected by the bus 202; the memory 201 stores a computer program executable on the processor 200, and the processor 200 executes the method provided in any of the foregoing embodiments of the present application when the computer program is executed.
The memory 201 may include a high-speed random access memory (RAM: random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the system network element and at least one other network element is implemented through at least one physical port 203 (which may be wired or wireless), the internet, a wide area network, a local network, a metropolitan area network, etc. may be used.
Bus 202 may be an ISA bus, a PCI bus, an EISA bus, or the like. The buses may be classified as address buses, data buses, control buses, etc. The memory 201 is configured to store a program, and the processor 200 executes the program after receiving an execution instruction, and the method disclosed in any of the foregoing embodiments of the present application may be applied to the processor 200 or implemented by the processor 200.
The processor 200 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 200 or by instructions in the form of software. The processor 200 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but may also be a Digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 201, and the processor 200 reads the information in the memory 201, and in combination with its hardware, performs the steps of the above method.
The electronic device provided by the embodiment of the application and the method provided by the embodiment of the application are the same in the invention conception, and have the same beneficial effects as the method adopted, operated or realized by the electronic device.
The present application further provides a computer readable storage medium corresponding to the method provided in the foregoing embodiments, referring to fig. 6, the computer readable storage medium is shown as an optical disc 30, on which a computer program (i.e. a program product) is stored, where the computer program, when executed by a processor, performs the method provided in any of the foregoing embodiments.
It should be noted that examples of the computer readable storage medium may also include, but are not limited to, a phase change memory (PRAM), a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a flash memory, or other optical or magnetic storage medium, which will not be described in detail herein.
The computer readable storage medium provided by the above-described embodiments of the present application has the same advantageous effects as the method adopted, operated or implemented by the application program stored therein, for the same inventive concept as the method provided by the embodiments of the present application.
It should be noted that:
the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may also be used with the teachings herein. The required structure for the construction of such devices is apparent from the description above. In addition, the present application is not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present application as described herein, and the above description of specific languages is provided for disclosure of preferred embodiments of the present application.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the present application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed application requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.
Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the present application and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.
Various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functions of some or all of the components in the creation means of a virtual machine according to embodiments of the present application may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present application may also be embodied as an apparatus or device program (e.g., computer program and computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present application may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.
The foregoing is merely a preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structural changes made by the specification and drawings of the present invention or direct/indirect application in other related technical fields are included in the scope of the present invention.

Claims (10)

1. A method of classifying documents, the method comprising:
determining a corresponding sorter group based on attributes of files to be sorted, wherein the attributes comprise pictures, documents and tables;
invoking the sorter group to sort the files to be sorted to obtain sorted files and unclassified files;
and calling a classification correction model to carry out secondary confirmation on the classified files to obtain classified files after secondary confirmation, and taking the classified files as file classification results.
2. The method of claim 1, wherein the sorter group includes a primary sorter, a secondary sorter, and a tertiary sorter;
the step of calling the sorter group to sort the files to be sorted to obtain sorted files and unclassified files, which comprises the following steps:
calling the first-level sorter to sort the files to be sorted, and classifying the files which are sorted successfully into sorted files;
invoking the secondary sorter to sort the files which are failed to be sorted in the primary sorter, and classifying the files which are successfully sorted into sorted files;
and calling the three-stage sorter to sort the files which are failed to be sorted in the two-stage sorter, classifying the files which are successfully sorted into sorted files, and classifying the files which are failed to be sorted into unclassified files.
3. The method of claim 2, wherein determining the corresponding sorter group based on the attributes of the documents to be sorted comprises:
according to the attribute of the file to be classified, the corresponding sorter group is an image classification model, and the image classification model is used for identifying objects or scenes in the image and classifying the objects or scenes;
according to the attribute of the file to be classified, the corresponding sorter group is a text classification model, and the text classification model is used for carrying out semantic analysis on the theme of the file and classifying the theme of the file;
And according to the attribute of the file to be classified, taking the corresponding sorter group as a statistical model, wherein the statistical model is used for carrying out probability estimation on the table data and classifying the table data.
4. A method according to claim 3, wherein the training process of each model in the sorter group comprises:
performing feature extraction on the data source samples aiming at the data source samples of each attribute to obtain a sample feature set;
marking the sample feature set to obtain a marked data set;
training based on the labeling data set to obtain a trained model.
5. The method of claim 1, wherein invoking the classification correction model to secondarily validate the classified file to obtain a secondarily validated classified file comprises:
calling a large language model to screen files which do not belong to the corresponding classification from the classified files;
and calling the large language model to classify the files which do not belong to the corresponding classification, so as to obtain classified files after secondary confirmation.
6. The method of claim 5, wherein said invoking the large language model to screen files from the classified files that do not belong to the respective classification comprises:
For any one of the classified files, invoking the large language model to calculate the distance between the text vector corresponding to the file and the center of the classified file;
acquiring a plurality of files according to the sequence from the near to the far;
calling the large language model to obtain type definition according to the files;
traversing each file in the plurality of files to obtain files belonging to the type definition, and determining the rest files as files not belonging to the corresponding classification.
7. The method of claim 1, further comprising, prior to said determining the corresponding sorter group based on the attributes of the documents to be sorted:
and performing multi-mode processing and format alignment processing based on the attributes of the files to be classified.
8. A document classification system, the system comprising:
the sorter determination module is used for determining a corresponding sorter group based on the attribute of the file to be sorted, wherein the attribute comprises a picture, a document and a table;
the primary classification module is used for calling the sorter group to classify the files to be classified to obtain classified files and unclassified files;
and the secondary classification module is used for calling a classification correction model to carry out secondary confirmation on the classified files to obtain classified files after secondary confirmation, and the classified files are used as file classification results.
9. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor executes to implement the method according to any of the claims 1-7 when running the computer program.
10. A computer readable storage medium having stored thereon computer readable instructions executable by a processor to implement the method of any of claims 1-7.
CN202410138691.5A 2024-02-01 2024-02-01 File classification method, system, equipment and readable storage medium Pending CN117668333A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410138691.5A CN117668333A (en) 2024-02-01 2024-02-01 File classification method, system, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410138691.5A CN117668333A (en) 2024-02-01 2024-02-01 File classification method, system, equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN117668333A true CN117668333A (en) 2024-03-08

Family

ID=90066420

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410138691.5A Pending CN117668333A (en) 2024-02-01 2024-02-01 File classification method, system, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN117668333A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060056703A1 (en) * 2004-09-13 2006-03-16 Scimed Life Systems, Inc. Systems and methods for producing a dynamic classified image
CN101923561A (en) * 2010-05-24 2010-12-22 中国科学技术信息研究所 Automatic document classifying method
CN116416632A (en) * 2023-04-06 2023-07-11 深圳平安综合金融服务有限公司 Automatic file archiving method based on artificial intelligence and related equipment
CN116701303A (en) * 2023-07-06 2023-09-05 浙江档科信息技术有限公司 Electronic file classification method, system and readable storage medium based on deep learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060056703A1 (en) * 2004-09-13 2006-03-16 Scimed Life Systems, Inc. Systems and methods for producing a dynamic classified image
CN101923561A (en) * 2010-05-24 2010-12-22 中国科学技术信息研究所 Automatic document classifying method
CN116416632A (en) * 2023-04-06 2023-07-11 深圳平安综合金融服务有限公司 Automatic file archiving method based on artificial intelligence and related equipment
CN116701303A (en) * 2023-07-06 2023-09-05 浙江档科信息技术有限公司 Electronic file classification method, system and readable storage medium based on deep learning

Similar Documents

Publication Publication Date Title
CN104517112B (en) A kind of Table recognition method and system
CN103299324B (en) Potential son is used to mark the mark learnt for video annotation
CN106055573B (en) Shoe print image retrieval method and system under multi-instance learning framework
CN110597961A (en) Text category labeling method and device, electronic equipment and storage medium
US20220318574A1 (en) Negative sampling algorithm for enhanced image classification
CN113407660A (en) Unstructured text event extraction method
KR20120047622A (en) System and method for managing digital contents
CN112685374B (en) Log classification method and device and electronic equipment
CN112560490A (en) Knowledge graph relation extraction method and device, electronic equipment and storage medium
CN116957304A (en) Unmanned aerial vehicle group collaborative task allocation method and system
CN117668333A (en) File classification method, system, equipment and readable storage medium
CN110399984B (en) Information prediction method and system and electronic equipment
CN111506776B (en) Data labeling method and related device
CN114254622A (en) Intention identification method and device
CN113051869A (en) Method and system for identifying text difference content by combining semantic recognition
Zhou et al. An improved convolutional neural network model with adversarial net for multi-label image classification
Kumaresh et al. Mining software repositories for defect categorization
CN117152563B (en) Training method and device for hybrid target domain adaptive model and computer equipment
Nicolaieff et al. Intelligent Document Processing with Small and Relevant Training Dataset
CN117171413B (en) Data processing system and method for digital collection management
CN114328797B (en) Content search method, device, electronic apparatus, storage medium, and program product
CN114219047B (en) Heterogeneous domain self-adaption method, device and equipment based on pseudo label screening
US20240119079A1 (en) Classification system, method, and program
Dang et al. Toward an Ensemble of Object Detectors
Junek et al. Classification of Deformed Objects Using Advanced LR Parsers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination