CN113254634A

CN113254634A - File classification method and system based on phase space

Info

Publication number: CN113254634A
Application number: CN202110153675.XA
Authority: CN
Inventors: 苏卫卫; 黄瑞; 衣秀; 张�成; 黄军阳
Original assignee: Tianjin Delta Technology Co ltd
Current assignee: Tianjin Delta Technology Co ltd
Priority date: 2021-02-04
Filing date: 2021-02-04
Publication date: 2021-08-13

Abstract

The invention provides a method and a system for classifying files based on a phase space, wherein the method for classifying the files comprises the following steps: reading the file content by adopting a text analysis technology and an OCR technology; automatically extracting the file keywords by adopting a keyword extraction technology; adopting word2vec to extract features aiming at the archive text, constructing a text vector, and simultaneously considering the global vector weight of the text and the keyword weight of the text; compressing the archive data by adopting a clustering technology; establishing a file classification model according to file contents by adopting a support vector machine text classification technology, evaluating the model by utilizing test data, and optimizing the model according to a model test result; and classifying the unknown class of file data by applying a file classification model. The invention solves the technical problem that the traditional file management technology can not comprehensively analyze the unstructured and semi-structured data of various file texts, and greatly saves manpower.

Description

File classification method and system based on phase space

Technical Field

The invention belongs to the technical field of file classification management, and particularly relates to a file classification method and system based on a phase space.

Background

The archive work is an indispensable component of various social careers, and the informatization has great influence on the archive work. The file document is intelligently managed by adopting a text analysis technology, an intelligent and networked service platform is constructed, a perfect intelligent file application system is formed, and required file information resource services are quickly and conveniently provided for all parties in the society. And establishing an intelligent file collection, intelligent management, intelligent service, intelligent protection and intelligent supervision platform, and realizing the integration based on electronic documents and the warehouse-type management of business data.

With the continuous expansion of production scale and operation scale, various large scientific research institutions and intellectual property base institutions in China have knowledge in the forms of treatises, survey reports, historical documents, academic monographs and the like. This knowledge information has been characterized as big data: firstly, it is large in scale, from TB level to PB level, and secondly it is quite complex in form, such as plain text, XML file, Office document, image, audio-video, etc. In particular, for more remote archival data, no electronic version, only paper version, and not particularly intact due to long-term storage, the results of recognition by OCR after scanning are not satisfactory, which directly affects the processing of such archives.

The archives are huge in types and content, classification of the archives is very important, accurate classification of the archives is defined, management and use of the archives are more convenient, time is consumed for manual classification, different people can understand standards when classification of the archives is performed, classification results can be different, and accuracy of archives classification is directly affected. The text classification technology is characterized in that key features capable of reflecting text characteristics are extracted from a text by learning the classification rule of known category data and adopting a machine learning method, and the mapping between the features and the categories is captured and used for processing the data of unknown categories.

The core idea of the method for classifying the texts of the files is to divide words of the text data of the files, carry out vectorization, and then carry out modeling by adopting a mining method, more words need to be reserved if more information needs to be reserved, so that the number of fields is undoubtedly more, the idea of minimizing the structural risk is introduced into the support vector machine method, support vectors on classification boundaries are searched, models are built only by using the support vectors, and all the building ideas determine that the support vector machine can obtain better prediction models by other methods even if fewer data samples are used, and the models have better generalization popularization capability.

Therefore, a file classification method and system based on a phase space are urgently needed, a text analysis technology is adopted, file contents are read from an electronic file, a word segmentation technology is adopted to segment file texts, keywords are automatically extracted, word2vec is adopted to vectorize the file texts, file text weights and keyword weights are comprehensively considered, a clustering technology is adopted to compress file text data, then a support vector machine classification method is adopted to establish a file classification model, and files are classified.

Disclosure of Invention

In order to solve the above technical problems, the present invention provides a method and a system for classifying files based on a phase space, wherein the method for classifying files comprises the following steps:

step S1: reading the file content by adopting a text analysis technology and an OCR technology;

step S2: automatically extracting the file keywords by adopting a keyword extraction technology;

step S3: adopting word2vec to extract features aiming at the archive text, constructing a text vector, and simultaneously considering the global vector weight of the text and the keyword weight of the text;

step S4: compressing the archive data by adopting a clustering technology;

step S5: establishing a file classification model according to file contents by adopting a support vector machine text classification technology, evaluating the model by utilizing test data, and optimizing the model according to a model test result; and classifying the unknown class of file data by applying a file classification model.

Preferably, the step S1 includes the steps of:

step S11: for a common electronic document, directly reading the file content by adopting a text analysis technology;

step S12: and identifying the content of the picture file by adopting an OCR technology for the scanned file and the picture file of the paper file.

Preferably, the step S2 includes the steps of:

step S21: performing word segmentation on the file by adopting a word segmentation technology;

step S22: and automatically extracting the file keywords by adopting a keyword extraction technology for constructing a text vector.

Preferably, the step S3 includes the following steps:

step S31: performing word segmentation on the text aiming at known types of archive data, and performing 0-1 vectorization;

step S32: words are vectorized by adopting word2vec, and text global information and keyword weight information are comprehensively considered.

Preferably, the step S4 includes the steps of:

step S41: constructing a clustering feature tree according to the similarity;

step S42: and extracting modeling data from the clustering feature tree.

Preferably, the step S5 includes the steps of:

step S51: dividing a data set into a training set and a testing set;

step S52: establishing a file classification model by using a training set and a support vector machine method based on data compression;

step S53: testing the classification model by using the test set, and optimizing the model according to the test result;

step S54: and classifying the unknown class of file data by applying a file classification model.

Preferably, the file classification system comprises a file data acquisition module, a file data extraction module, a file data classification modeling module, a file classification model evaluation module and a file classification model using module; the file data acquisition module is used for acquiring file data and reading file contents from the electronic document; the archive data extraction module is used for segmenting the archive data and extracting keywords; the archive classification model modeling module is used for classifying archive data, words are vectorized by adopting word2vec, the weights of all words of a single archive document and the weights of keywords are considered, the archive data are compressed by adopting a clustering idea, and a classification model is established by adopting a support vector machine; the file classification model evaluation module evaluates the file classification model by adopting test data and optimizes the model according to an evaluation result; and the archive classification model using module is used for judging the classification of unknown classification data by using the established model and storing the classification result.

Compared with the prior art, the invention has the beneficial effects that: the invention can read normal electronic documents, can also read picture data by adopting an OCR recognition technology, not only considers the weighting weight of all words of the file, but also focuses on the weighting of key words, so that the information is more comprehensive, and compresses the data by adopting clustering, thereby not only considering the universality of the data, but also reserving the characteristics of main data, and ensuring that the generalization capability of the model is better, thereby solving the technical problem that the traditional file management technology can not comprehensively analyze unstructured and semi-structured data of various file texts, and greatly saving manpower.

Drawings

FIG. 1 is a schematic diagram of the system of the present invention;

FIG. 2 is an overall flow diagram of the present invention;

FIG. 3 is a data processing flow diagram of the data compression link of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings:

example (b):

a method and a system for classifying files based on phase space are disclosed, as shown in figure 1, the file classification system comprises a file data acquisition module, a file data extraction module, a file data classification modeling module, a file classification model evaluation module and a file classification model using module; the file data acquisition module is used for acquiring file data and reading file contents from the electronic document; the archive data extraction module is used for segmenting the archive data and extracting keywords; the archive classification model modeling module is used for classifying archive data, words are vectorized by adopting word2vec, the weights of all words of a single archive document and the weights of keywords are considered, the archive data are compressed by adopting a clustering idea, and a classification model is established by adopting a support vector machine; the file classification model evaluation module evaluates the file classification model by adopting test data and optimizes the model according to an evaluation result; and the archive classification model using module is used for judging the classification of unknown classification data by using the established model and storing the classification result.

As shown in fig. 2, the archive classification method includes the following steps:

step S1: collecting archive data, and reading the archive content by adopting a text analysis technology;

step S12: identifying the content of the picture file by adopting an OCR technology for the scanned file and the picture file of the paper file;

the OCR technical process comprises image preprocessing, character detection and text recognition; the image preprocessing adopts a neural network based on CNN as a characteristic extraction means; the character detection adopts box to mark all character positions in the image; the text recognition adopts a CRNN + CTC algorithm, firstly CNN extracts image convolution characteristics, then LSTM further extracts sequence characteristics in the image convolution characteristics, and finally CTC is introduced to solve the problem that characters cannot be aligned during training;

step S2: preprocessing archive data, comprising the following steps:

step S21: performing word segmentation on the read file text data, and removing stop words;

step S22: extracting file keywords by adopting a keyword extraction technology aiming at the file text;

step S3: performing feature extraction on the archive text, wherein a word2vec method is adopted to construct a text vector, and the method comprises the following steps:

step S31: aiming at known types of archive data, performing 0-1 vectorization on all words of a text;

step S32: carrying out weighted average by adopting the weights of the word2 vecs, extracting the weights of the word2 vecs of the keywords of the text, and combining the weights of the two parts, thereby not only considering the weights of all words of a single text, keeping the integrity of text information, but also highlighting the weight information of the keywords;

step S4: as shown in fig. 3, the clustering technique is used to compress the archive data, and includes the following steps:

step S41: traversing from the root node of the clustering characteristic tree;

step S42: if the current node is a leaf node, go to step S43, otherwise go to step S46;

step S43: finding a child node closest to the data in the current node, calculating the cluster diameter after merging the data and the data of the child node, if the cluster diameter is smaller than a threshold value, turning to the step S44, otherwise, turning to the step S45;

step S44: merging the piece of data with the nearest child node;

step S45: the data is used as a new child node of the current node, at this time, if the number of child nodes of the current node exceeds a certain threshold, the current node is split into two nodes, two child nodes with the farthest distance can be selected as initial nodes, and other child nodes are divided into proper nodes according to the distance to be combined;

step S46: and finding the child node closest to the piece of data in the current node, and taking the child node as a new current node, and going to step S42.

For the newly added data, the new data can be added on the original clustering feature tree without reconstructing the clustering feature tree by using all data.

Extracting modeling data from the clustering feature tree, and forming a classification hyperplane to construct a model by searching for a support vector because a support vector machine is a modeling method based on a structural risk minimization principle; based on the method, the boundary of each cluster of data under the leaf node of the clustering feature tree can be calculated, and the boundary point which is most likely to become the support vector is taken as the modeling data of the support vector machine, so that the data compression is realized.

In the present embodiment, the specific boundary calculation method is described by the following example:

assume that a cluster of data contains records: (-5, -4, -2), (-4, -6, -7), (-3, -2,0), (-2, -1,1), (-1,0,2), (0,1,3), (1,2,4), (2,3,5), (3,4,6), (4,5,7), (5,9,8), (6,7,9), then take the maximum and minimum of Top2 in each dimension:

the 1 st dimension maximum points are (6,7,9), (5,9,8), and the minimum points are: (-5, -4, -2), (-4, -6, -1), the 2 nd dimension maxima are (5,9,8), (6,7,9), the minima are: (-4, -6, -1), (-5, -4, -2), maxima in dimension 3 are (6,7,9), (5,9,8), minima are: (-4, -6, -7),(-5, -4, -2)

Finally, the selected extreme point is a union set of the different extreme points, and 5 records are obtained in total;

step S5: constructing a file classification model by adopting a support vector machine method; evaluating the model by using the test data, and optimizing the model according to the test result of the model; and classifying the unknown class of file data by applying a file classification model.

The invention can read normal electronic documents, can also read picture data by adopting an OCR recognition technology, not only considers the weighting weight of all words of the file, but also focuses on the weighting of key words, so that the information is more comprehensive, and compresses the data by adopting clustering, thereby not only considering the universality of the data, but also reserving the characteristics of main data, and ensuring that the generalization capability of the model is better, thereby solving the technical problem that the traditional file management technology can not comprehensively analyze unstructured and semi-structured data of various file texts, and greatly saving manpower.

The technical solutions of the present invention or similar technical solutions designed by those skilled in the art based on the teachings of the technical solutions of the present invention are all within the scope of the present invention.

Claims

1. A method and a system for classifying files based on a phase space are characterized in that the method for classifying the files comprises the following steps:

step S4: compressing the archive data by adopting a clustering technology;

2. The method and system for classifying files according to claim 1, wherein said step S1 comprises the steps of:

3. The method and system for classifying files according to claim 1, wherein said step S2 comprises the steps of:

4. The method and system for classifying files according to claim 1, wherein said step S3 comprises the steps of:

5. The method and system for classifying files according to claim 1, wherein said step S4 comprises the steps of:

step S41: constructing a clustering feature tree according to the similarity;

step S42: and extracting modeling data from the clustering feature tree.

6. The method and system for classifying files according to claim 1, wherein said step S5 comprises the steps of:

step S51: dividing a data set into a training set and a testing set;

7. The method and system for classifying files based on phase space according to claim 1, wherein the file classification system comprises a file data acquisition module, a file data extraction module, a file data classification modeling module, a file classification model evaluation module, and a file classification model using module; the file data acquisition module is used for acquiring file data and reading file contents from the electronic document; the archive data extraction module is used for segmenting the archive data and extracting keywords; the archive classification model modeling module is used for classifying archive data, words are vectorized by adopting word2vec, the weights of all words of a single archive document and the weights of keywords are considered, the archive data are compressed by adopting a clustering idea, and a classification model is established by adopting a support vector machine; the file classification model evaluation module evaluates the file classification model by adopting test data and optimizes the model according to an evaluation result; and the archive classification model using module is used for judging the classification of unknown classification data by using the established model and storing the classification result.