CN113190502A

CN113190502A - Archive management method based on deep learning

Info

Publication number: CN113190502A
Application number: CN202110489471.3A
Authority: CN
Inventors: 刘伊玲; 杨坤丽; 柯燕; 周琼凤; 吴冬梅
Original assignee: Information Center of Yunnan Power Grid Co Ltd
Current assignee: Information Center of Yunnan Power Grid Co Ltd
Priority date: 2021-01-26
Filing date: 2021-05-06
Publication date: 2021-07-30

Abstract

The invention belongs to the technical field of machine learning, and particularly relates to a file management method based on deep learning.

Description

Archive management method based on deep learning

Technical Field

Background

At present, a digital platform is not supported in file management, file compliance inspection is still carried out in a manual mode, the number of files is large, a large amount of manpower and time need to be invested, and the standardization and the integrity cannot be well guaranteed. Archival data unstructured data accounts for about 90%, and critical information data and knowledge are usually hidden in these unstructured data.

Deep Learning (DL) is a new research direction in the field of machine Learning, and by Learning the intrinsic rule and the representation hierarchy using unstructured data as sample data, a terminal device has an analysis Learning capability and can recognize data such as characters, images and sounds. Deep learning is a complex machine learning algorithm, and the effect achieved in the aspect of image recognition is far better than that achieved in the prior related art. In the prior art, deep learning is widely applied to the fields of search technology, data mining, machine learning, machine translation, natural language processing and the like.

In the aspect of image recognition, the power grid industry archive data is generally subjected to line scanning on characters in an archive through OCR to obtain unstructured data, and distorted text lines, skew, noise and other defects common in scanned images and digital photos in the archive data may reduce recognition quality, so that the recognition error rate is high.

Disclosure of Invention

Based on the problem that no digital platform is supported in the existing file management, the invention aims to explore and obtain a file management method based on deep learning.

The invention discloses a deep learning-based archive management method, which is characterized by comprising the following steps of:

1) collecting files: firstly, establishing a configuration file in a terminal, reading the configuration file during filing, and judging whether the configuration file reaches the maximum volume number; and sequentially searching next file directories which do not reach the maximum volume number or automatically creating file directories, continuously finishing filing operation, storing paper file files into a picture format by matching with external equipment comprising a scanner, creating and automatically uploading the directories in a file server according to file related information by the system, automatically judging that a plurality of pages of documents are uploaded if the scanning page is not closed and the scanning is continuously carried out, automatically recording storage paths of the files and corresponding electronic file numbers by the system after the uploading, and associating the electronic files with the pictures to realize the acquisition and the uploading of the electronic files.

2) And (3) file assembly: establishing an electronic file library to enable the electronic file to correspond to the entity file; the system comprises the steps of verifying an electronic file according to integrity in a compliance rule by using a deep learning model, reading file information by using an identification method, scanning characters in the file by using an OCR (optical character recognition) method to obtain unstructured data, performing operations such as noise reduction, binarization, data cleaning and the like on the data, inputting valuable data into a decision tree to construct a model, modifying parameters after comparing the structure authenticity, adjusting weights to finally obtain a check model, confirming the file information by the check model, automatically sequencing and sorting the file information according to a set assembly template, realizing intelligent and automatic assembly of the file, and realizing one-key assembly into a book according to user requirements;

3) searching and classifying archives: the intelligent one-key volume grouping function realizes structured management on orderly and regular intelligent volumes of the files passing the inspection; by keyword retrieval, intelligent analysis of content abstract, attribute extraction, tag identification and content association, the association between the searched content and the electronic archive is summarized, and data support is provided for subsequent intelligent retrieval and analysis; meanwhile, a big data intelligent analysis report is generated in real time.

4) And (3) identifying the file value: value identification is carried out on the electronic file based on the technical advancement, functionality, typicality and time limit of the production project file to obtain the total score of the file value, and reference is provided for management decision and daily work according to the score of the file value; setting a file value destruction standard score, and providing a destruction prompt and suggestion for files with scores lower than the destruction standard value; the big data is utilized to carry out overall analysis and mining of value information, and the incidence relation among projects is displayed in a targeted manner, and the economic benefit and the social benefit brought by the projects are counted and analyzed.

The identification method in the step 2) also comprises a picture identification method based on deep learning; for picture files existing in the electronic files, the Keras deep learning framework and the VGG16 network model are used for realizing content recognition of the picture files in a picture searching mode, and feature extraction is carried out on the content in the picture files.

Further, the graph searching is realized by the following steps:

1) extracting image features of the archive photo: in order to accurately extract the image characteristics of each photo file, the size of each photo file needs to be uniformly preprocessed, Keras is loaded with a pre-trained VGG16 model to extract the characteristic information of each image on a full connection layer, the information of each image is represented as a vector after being extracted, the image characteristics of a plurality of pictures corresponding to each picture are aggregated to form a vector set, and the vector set is integrated with an archive picture image characteristic index database.

2) Extracting image features of the picture to be retrieved: finding out the photo matched with the content of the picture to be retrieved in the picture archive, wherein the key is to zoom and cut the picture by using an Image Data Generator tool for accurately sampling the characteristic information represented by the picture to be retrieved, and extracting an Image characteristic information vector set after processing the picture into a plurality of pictures with the sizes of 224 multiplied by 224.

3) Comparing the image features, and returning a retrieval result: comparing the image feature information vector set of the picture to be retrieved with the information in the image feature index database of the archive photo by using a locality sensitive hashing algorithm, wherein the greater the similarity, the higher the content similarity of the two pictures. Apache Tomcat is deployed as a graph search graph Web search application, the picture to be searched is uploaded and a search result is returned on the front support, a back-end system uses a Keras deep learning framework to perform feature extraction on images in a photo archive one by one to obtain an image feature index library, in order to search out a picture set similar to a query picture, a local Hirsch algorithm is used for constructing an index library for a feature vector set for extracting all archive pictures, and the search speed is accelerated by comparing the feature vector set with the features of the picture to be searched.

The invention carries out intelligent file management based on the deep learning technology, improves the utilization of file resources and the intelligent management level of files, reduces the management cost and ensures the integrity, authenticity, reliability and usability of science and technology files.

Detailed Description

Example 1: the archive management method is implemented as follows.

Collecting files: firstly, establishing a configuration file in a terminal, reading the configuration file during filing, judging whether the configuration file reaches the maximum volume number, sequentially searching the next file directory which does not reach the maximum volume number or automatically creating the file directory, continuously finishing the filing operation, storing a paper file into a picture format by matching with external equipment comprising a scanner, firstly carrying out OCR character extraction on project information according to project text data, carrying out regular preprocessing and noise reduction preprocessing on the unstructured data or semi-structured data, then inputting the data into a decision tree, carrying out data analysis according to specific service logic and obtaining an analysis result, and adjusting parameters and weights after manually intervening and comparing the result. Then inputting a large amount of data into an artificial neural network, carrying out data set training, taking a part of data sets as test data sets, and forming a model after judging that the offset of the result is within an error range through testing. The system creates a directory in the file server according to the file related information and automatically uploads the directory, if the scanning page is not closed and the scanning is continued, the system automatically judges that the multi-page document is uploaded, after the uploading, the system automatically records the storage path of the document and the file number of the electronic file corresponding to the storage path, associates the electronic file with the picture, and realizes the acquisition and uploading of the electronic file.

And (3) file assembly: establishing an electronic file library to enable the electronic file to correspond to the entity file; verifying the integrity of the electronic file according to the compliance rules by using a deep learning model, and reading the file information by a system through an identification method, wherein the identification method comprises but is not limited to an OCR character identification method; for picture files existing in the electronic files, the Keras deep learning framework and the VGG16 network model are used for realizing content recognition of the picture files in a picture searching mode, and feature extraction is carried out on the content in the picture files. According to the archives information of discerning, carry out the automatic sequencing arrangement according to the assembly template that sets up, realize the automatic assembly of archives intelligence, improve archives assembly work efficiency, derive as required and bind, reduce the assembly and bind the cost.

Searching and classifying archives: by keyword retrieval, intelligent analysis of content abstract, attribute extraction, tag identification and content association, the association between the searched content and the electronic archive is summarized, and data support is provided for subsequent intelligent retrieval and analysis; meanwhile, a big data intelligent analysis report is generated in real time.

And (3) identifying the file value: value identification is carried out on the electronic file based on the technical advancement, functionality, typicality and time limit of the production project file to obtain the total score of the file value, and reference is provided for management decision and daily work according to the score of the file value; and setting a file value destruction standard score, and providing a destruction reminding and suggestion for files with scores lower than the destruction standard value.

Claims

1. The archive management method based on deep learning is characterized by being realized through the following steps:

(1) collecting files: firstly, establishing a configuration file in a terminal, reading the configuration file during filing, and judging whether the configuration file reaches the maximum volume number; searching next file catalog which does not reach the maximum volume number or automatically creating a file catalog according to the sequence, continuously finishing filing operation, storing paper file files into a picture format by matching with external equipment comprising a scanner, creating the catalog in a file server according to file related information by a system and automatically uploading the catalog, automatically judging that a plurality of pages of documents are uploaded if scanning is continuously carried out without closing a scanning page, automatically recording the storage path of the files and the file numbers of the corresponding electronic files by the system after uploading, and associating the electronic files with the pictures to realize acquisition and uploading of the electronic files;

(2) and (3) file assembly: establishing an electronic file library to enable the electronic file to correspond to the entity file; the system comprises the steps of verifying an electronic file according to integrity in a compliance rule by using a deep learning model, reading file information by using an identification method, scanning characters in the file by using an OCR (optical character recognition) method to obtain unstructured data, performing operations such as noise reduction, binarization, data cleaning and the like on the data, inputting valuable data into a decision tree to construct a model, modifying parameters after comparing the structure authenticity, adjusting weights to finally obtain a check model, confirming the file information by the check model, automatically sequencing and sorting the file information according to a set assembly template, realizing intelligent and automatic assembly of the file, and realizing one-key assembly into a book according to user requirements;

(3) searching and classifying archives: the intelligent one-key volume grouping function realizes structured management on orderly and regular intelligent volumes of the files passing the inspection; by keyword retrieval, intelligent analysis of content abstract, attribute extraction, tag identification and content association, the association between the searched content and the electronic archive is summarized, and data support is provided for subsequent intelligent retrieval and analysis; meanwhile, generating a big data intelligent analysis report in real time;

(4) and (3) identifying the file value: value identification is carried out on the electronic file based on the technical advancement, functionality, typicality and time limit of the production project file to obtain the total score of the file value, and reference is provided for management decision and daily work according to the score of the file value; setting a file value destruction standard score, and providing a destruction prompt and suggestion for files with scores lower than the destruction standard value; the great number is utilized to carry out overall analysis and mining of value information, and the incidence relation among projects is displayed in a targeted manner, and the economic benefit and the social benefit brought by the projects are counted and analyzed.

2. The archive management method based on deep learning of claim 1, wherein the recognition method of step 2) further comprises a picture recognition method based on deep learning; for picture files existing in an electronic archive, a Keras deep learning framework and a VGG16 network model are used for realizing content recognition of the picture archive in a picture searching mode, and feature extraction is carried out on contents in the picture archive, wherein the picture searching mode is realized through the following steps:

(1) extracting image features of the archive photo: in order to accurately extract the image characteristics of each photo file, the size of each photo file needs to be uniformly preprocessed, Keras is used for extracting the characteristic information of each image on a full connection layer by loading a pre-trained VGG16 model, the information of each image is represented as a vector after being extracted, the image characteristics of a plurality of pictures corresponding to each picture are aggregated to form a vector set, and the vector set is integrated with an archive picture image characteristic index database;

(2) extracting image features of the picture to be retrieved: finding a photo matched with the content of the picture to be retrieved in the picture archive library, wherein the key is that the picture is zoomed and cut by using an Image Data Generator tool for accurately sampling the characteristic information represented by the picture to be retrieved, and the Image characteristic information vector set is extracted after the picture is processed into a plurality of pictures with the size of 224 multiplied by 224;

(3) comparing the image features, and returning a retrieval result: comparing the image feature information vector set of the picture to be retrieved with the information in the image feature index database of the archive photo by using a locality sensitive hashing algorithm, wherein the greater the similarity, the higher the content similarity of the two pictures.