CN117556112B

CN117556112B - Intelligent management system for electronic archive information

Info

Publication number: CN117556112B
Application number: CN202410039161.5A
Authority: CN
Inventors: 许潇文; 冯蕾; 杨锋; 杨正军; 宋林霖; 满鑫; 赵阳阳; 李莹; 刘霞; 李亚
Original assignee: China National Institute of Standardization
Current assignee: China National Institute of Standardization
Priority date: 2024-01-11
Filing date: 2024-01-11
Publication date: 2024-04-16
Anticipated expiration: 2044-01-11
Also published as: CN117556112A

Abstract

The invention discloses an electronic archive information intelligent management system, which particularly relates to the technical field of archive information management, and comprises an archive classification module, an archive identification module, a manual identification module, an archive reorganization module, an archive encryption module, a retrieval abstract generation module, a circulation file library, an electronic archive library and a bulk file database.

Description

Intelligent management system for electronic archive information

Technical Field

The invention relates to the technical field of archive information management, in particular to an intelligent electronic archive information management system.

Background

The electronic archive information intelligent management system is a digital archive management system, and the electronic archive is an electronic file with the values of certificates, examination and preservation and is a digital format of various information records formed, transacted, transmitted and stored by electronic equipment such as a computer in the process of fulfilling legal responsibilities or processing transactions of national institutions, social organizations or individuals. The electronic file consists of content, structure and background.

The prior art has the following defects: the current electronic archive information intelligent management system mainly divides a standard archive range related file into a plurality of keywords through an archive worker with abundant experience to form a filtering rule, matches according to the keywords, considers that the matching is passed if the archive title contains the corresponding keywords, and triggers a matching passing flow only when the title contains the keywords entirely, while for unstructured documents, for example, the documents may be simple text files due to the fact that the unstructured documents have no clear title format. In this case, the system may incorrectly exclude these documents from the range that needs archiving, even if they contain content that meets the criteria. These problems may lead to a constant accumulation of vulnerabilities in the system, leading to discontents for the system user, as time goes on and the number of files processed increases.

In order to solve the above-mentioned defect, a technical scheme is provided.

Disclosure of Invention

In order to overcome the above-mentioned drawbacks of the prior art, an embodiment of the present invention provides an electronic archive information intelligent management system to solve the above-mentioned problems set forth in the background art.

In order to achieve the above purpose, the present invention provides the following technical solutions:

the electronic archive information intelligent management system comprises an archive classification module, an archive identification module, a manual identification module, an archive reorganization module, an archive encryption module, a retrieval abstract generation module, a circulation document library, an electronic archive library and a bulk document database, wherein the modules are connected through signals;

the circulation file library is used for receiving the electronic files transmitted by each department;

the file classification module is used for dividing the electronic files transmitted by the circulation file library into structural files and unstructured files according to preset division standards, and transmitting the classified electronic files to the archiving identification module;

the archive identification module is used for judging whether the classified electronic archive needs to be archived or not by adopting different identification modes, transmitting the electronic archive needing to be archived to the archive reorganization module, and transmitting the electronic archive not needing to be archived to the bulk archive database;

and a manual identification module: the method comprises the steps of performing secondary identification on an unstructured document with failed primary archiving identification, searching keywords by staff, establishing a theme, and judging whether archiving is needed or not;

the file reorganization module is used for reorganizing the electronic files which are authenticated and need to be archived, and transmitting the reorganized electronic files to the file encryption module;

the file encryption module is used for carrying out encryption operation on the electronic file after the integral editing, ensuring that the text information of the electronic file cannot be obtained even if the electronic file is stolen on a network in the borrowed transmission process, and transmitting the encrypted electronic file to the retrieval abstract generation module;

the search summary generation module is used for generating an electronic file summary of the encrypted electronic file based on the keywords so that file workers can quickly acquire main contents of the file, and transmitting the electronic file with the generated summary to the electronic archive;

the electronic archive is used for storing the electronic archive which needs to be archived and the electronic archive which is used for borrowing circulation;

the scattered file database is used for storing the electronic archives which do not need to be archived and cleaning the electronic archives regularly.

In a preferred embodiment, the archiving and authentication module is configured to determine whether the electronic file needs to be archived according to different authentication manners of the classified electronic file, and includes the following steps:

structural document:

manually splitting keywords: for a structural document, firstly, a standard archive scope related file is split into a plurality of archive scope keywords by an experienced archive worker;

adding an identification rule: matching is carried out according to the archive scope keywords, if the archive title contains the corresponding archive scope keywords, the matching is considered to pass, and the matching passing flow is triggered only when the title contains the archive scope keywords;

archiving, identifying and matching: matching the electronic file title with the identification rule by using the archive scope keyword, and if the matching is successful, transmitting the electronic file title to a file reorganizing module; if the matching fails, transmitting the matching to a bulk file database;

unstructured document:

text extraction: firstly, scanning texts in a document by a text recognition (OCR) technology to realize extraction of the texts;

keyword extraction: extracting document keywords from the text according to a keyword classification model based on a classification algorithm;

topic modeling analysis: modeling and analyzing the document according to the document keywords by a topic modeling technology to realize extraction of the document topics;

adding an identification rule: matching is carried out according to the archive scope keywords, if the extracted topics contain the corresponding archive scope keywords, the matching is considered to pass, and the matching passing flow is triggered only when the extracted topics contain the archive scope keywords;

archiving, identifying and matching: the method comprises the steps of carrying out archival range keyword matching on a theme extracted from an electronic archive and an identification rule, and if matching is successful, transmitting the theme to an archive reorganization module; if the matching fails, the matching is transmitted to a manual identification module.

In a preferred embodiment, keyword extraction: extracting document keywords from text according to a keyword classification model based on a classification algorithm further comprises the steps of:

data acquisition, data cleaning, keyword marking, word vector construction and word vector-based two-class model construction.

In a preferred embodiment, after the construction of the word vector is completed, obtaining word embedding distance information and word vector dimension information of every two documents in the word vector set, wherein the word embedding distance information comprises word embedding distance, the word vector dimension information comprises word vector maximum dimension difference value, and obtaining a document screening index by calculating the word embedding distance and the word vector maximum dimension difference value through weighted summation;

comparing the document screening index to a document screening index threshold;

if the document screening index is greater than or equal to the document screening index threshold, deleting the one of the two documents with the lower maximum dimensionality of the word vector;

the filtered word vector is used as training data of a two-class model based on the word vector.

In a preferred embodiment, the training data are preprocessed and respectively input into a classification algorithm for training prediction, and the performance of the algorithm is comprehensively evaluated by acquiring training consumption information and training result feedback information of the algorithm;

the training consumption information comprises an abnormal time complexity coefficient, and the training result feedback information comprises a training result F1 value;

calculating the complexity coefficient of the abnormal time and the F1 value of the training result through weighted summation to obtain an algorithm performance evaluation index; and sorting the selected algorithms from large to small according to the algorithm performance evaluation indexes to obtain an algorithm use sequence which is put into use, taking the algorithm of the first bit of the algorithm use sequence as a classification algorithm for primary use, and taking the subsequent algorithm as a standby algorithm.

In a preferred embodiment, the running state of the classification algorithm used for the first time is tracked and evaluated according to a preset time interval in the actual running process of the system, and the algorithm performance evaluation index obtained by the subsequent tracking and evaluation is compared with an algorithm performance evaluation index reference threshold value;

if the algorithm performance evaluation index is smaller than the algorithm performance evaluation index reference threshold, replacing the algorithm in use with the sequence algorithm in the algorithm use sequence, and retraining the replaced algorithm to perfect.

The invention has the technical effects and advantages that:

1. according to the invention, through classifying the electronic files transferred by each department, the electronic files are divided into the structural files and the non-structural files, different processing modes are respectively adopted for identifying the electronic files for different structural files, errors caused by the fact that the non-structural files cannot be processed when the electronic files are archived and identified are reduced, the structural files can be rapidly archived according to preset rules, the non-structural files can be processed in a more complex mode, the processing speed of the structural files is not influenced, encryption operation is carried out on the integrally-coded electronic files, the fact that text information of the electronic files cannot be acquired even if the electronic files are stolen on a network in the borrowing transmission process is ensured, and the electronic file abstract is generated based on keywords, so that file workers can rapidly acquire main contents of the files.

2. According to the method, the document keywords are extracted from the text according to the keyword classification model based on the two classification algorithms, the method comprises the steps of data acquisition, data cleaning, keyword marking, word vector construction and the like, and after the word vector construction is completed, different documents in a word vector set are evaluated by acquiring word embedding distance information and word vector dimension information, so that redundancy of a training set is reduced, and the situation that the output result of the model is too concentrated singly, and the actual application is not in line with expectations is avoided.

3. According to the invention, after the word vector is determined, a two-class model is constructed by adopting a mixed mode of multiple classification algorithms, the performance of the algorithms is comprehensively evaluated by acquiring training consumption information and training result feedback information of the algorithms, the algorithms are ranked in priority according to the evaluation result, the first-ranked algorithm is put into use, the running state of the algorithms put into use is continuously evaluated, when the running state of the algorithms has a descending trend, the algorithms in use sequence of the algorithms are replaced by the sequence algorithms, the replaced algorithms are retrained and perfected, and the continuous running of the algorithms with poor running state is avoided, so that the errors of subsequent output occur.

Drawings

For the convenience of those skilled in the art, the present invention will be further described with reference to the accompanying drawings;

FIG. 1 is a schematic structural diagram of embodiment 1 of the present invention;

FIG. 2 is a schematic structural diagram of embodiment 2 of the present invention;

fig. 3 is a schematic structural diagram of embodiment 2 of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

The invention provides an electronic archive information intelligent management system shown in figure 1, which comprises an archive classification module, an archive identification module, a manual identification module, an archive reorganization module, an archive encryption module, a search abstract generation module, a circulation file library, an electronic archive library and a bulk file database, wherein the modules are connected through signals;

the file classification module is used for dividing the electronic files transmitted by the circulation file library into structural files and unstructured files according to preset division standards, and transmitting the classified electronic files to the archive identification module, and comprises the following steps:

determining a division standard: firstly, manually defining a division standard according to the document type, wherein the division standard is as follows: if the document has an explicit title or the document content is written by an explicit template and a standard format (such as a name, contact information, educational experience and work experience all have a clear label), the document is divided into structural documents, and if the document does not have the explicit title or the standard format, the document content is only a freely organized text file, the document is divided into non-structural documents;

dividing the electronic file: dividing the electronic files transmitted by the circulation file library into structural files and unstructured files according to the dividing standard, and transmitting the classified electronic files to an archiving identification module;

the archive identification module is used for judging whether the classified electronic archives need to be archived according to different identification modes, transmitting the electronic archives needing to be archived to the archive reorganization module, and transmitting the electronic archives not needing to be archived to the bulk file database, and comprises the following steps:

A. structural document:

the standard archive range related files include the authority file material archive range and the document archive storage period regulation;

B. unstructured document:

it should be noted that, because the unstructured document has no specific document title and label format, it needs to be subject-extracted, and archive and identify it according to the extracted subject;

archiving, identifying and matching: the method comprises the steps of carrying out archival range keyword matching on a theme extracted from an electronic archive and an identification rule, and if matching is successful, transmitting the theme to an archive reorganization module; if the matching fails, the matching is transmitted to a manual identification module;

and a manual identification module: the method is used for carrying out secondary authentication on the unstructured document with failed primary archiving authentication, searching keywords by staff, establishing a theme and judging whether archiving is needed or not, and comprises the following steps:

and (3) manual identification: carrying out secondary identification on the unstructured document which fails to be first archived and identified by a worker, searching keywords of the document, establishing a document theme, and simultaneously archiving and identifying according to working experience;

if the file is needed, the file is transmitted to a file reorganizing module, and if the file is not needed, the file is transmitted to a bulk file database;

transmitting the electronic file identified by the secondary manual identification to an archiving identification module to be used as training data of a model in an unstructured document, and regularly training and perfecting the model;

the file reorganization module is used for reorganizing the electronic files which are required to be archived after identification, and transmitting the reorganized electronic files to the file encryption module, and comprises the following steps:

it should be noted that the above-mentioned reorganization operation corresponds to a boxing operation in the physical file, that is, a virtual box number is assigned to the electronic file;

and (3) definition of a reorganization rule: defining reorganization rules that should include which metadata fields are to be used for reorganization (e.g., archive type, year, organization, etc.) and how to define box numbers, determining naming rules for virtual box numbers, e.g., generating unique box numbers in conjunction with metadata of year, archive type, etc.;

it should be noted that, the definition of the reorganization rule may be obtained by referring to the archival range and the storage deadline table of the document in each region, for example, the "archival range and the storage deadline table of the document in the department of Beijing city";

metadata extraction: related metadata such as file type, year, mechanism and other information can be extracted from each electronic file to be integrated, and can be automatically extracted or manually input;

generating a box number: generating a virtual box number based on the reorganization rule and metadata extraction so as to identify the reorganization box of the electronic file;

electronic archive association: associating the electronic file with the generated box number, and transmitting the associated electronic file to a file encryption module;

the file encryption module is used for encrypting the electronic file after the whole editing, ensuring that the text information of the electronic file cannot be obtained even if the electronic file is stolen on a network in the borrowed transmission process, and comprises the following steps:

generating a summary of the borrowed electronic file by using a hash algorithm;

generating a digital signature for the abstract by using a private key of the text issuing mechanism system;

encrypting the electronic file original text and the digital signature I by using a symmetric encryption algorithm to form a ciphertext K;

encrypting the ciphertext K and a secret key of the symmetric encryption algorithm by using a public key of a receiver to form a ciphertext L;

the receiver decrypts the ciphertext L to obtain a ciphertext K and a symmetric encryption key;

the receiver decrypts the ciphertext K by using the symmetric encryption key to obtain an electronic file original text and a digital signature for the original text;

the receiver uses the public key of the sender mechanism to decrypt the digital signature to obtain the message digest of the sender text;

the receiver acquires a message abstract from the received archive text;

comparing whether the two message digests are the same or not, if so, passing the verification; if not, the verification fails.

The search summary generating module is used for generating an electronic file summary of the encrypted electronic file based on the keywords so that a file worker can quickly acquire main contents of the file, and transmitting the electronic file with the generated summary to the electronic archive, and comprises the following steps:

keyword extraction: extracting keywords from the text according to a keyword classification model based on a classification algorithm;

it should be noted that, for unstructured documents, this step may be omitted because the document keywords have been obtained in the archival authentication module;

extracting key sentences: traversing each sentence of the document content in the electronic file, if the sentence contains keywords, adding the sentence into an alternative keyword sentence set, simultaneously recording the number of the keywords contained in the sentence, and after traversing, sequencing the keywords in the keyword sentence set from large to small according to the number of the keywords;

and (5) search abstract generation: acquiring first key sentences in a key sentence set as selected key sentences, and simultaneously acquiring front and rear M sentences of the selected key sentences as search summaries of the electronic documents;

it should be noted that, M sentences in front and behind can be set according to actual conditions, which is a dynamic variable;

the circulation file library is used for receiving the electronic files transmitted by each department, namely an electronic file circulation library;

the scattered file database is used for storing the electronic archive which does not need to be archived and cleaning the electronic archive regularly;

according to the invention, through classifying the electronic files transferred by each department, the electronic files are divided into the structural files and the non-structural files, different processing modes are respectively adopted for identifying the electronic files for different structural files, errors caused by the fact that the non-structural files cannot be processed when the electronic files are archived and identified are reduced, the structural files can be rapidly archived according to preset rules, the non-structural files can be processed in a more complex mode, the processing speed of the structural files is not influenced, encryption operation is carried out on the integrally-coded electronic files, the fact that text information of the electronic files cannot be acquired even if the electronic files are stolen on a network in the borrowing transmission process is ensured, and the electronic file abstract is generated based on keywords, so that file workers can rapidly acquire main contents of the files.

Example 2

In the archiving and identifying module of the above embodiment, an archiving and identifying mode different from a structural document is adopted for an unstructured document, text in the document is firstly scanned by a text recognition (OCR) technology to realize text extraction, document keywords are extracted from the text according to a keyword classification model based on a classification algorithm, the document is modeled and analyzed according to the document keywords by a topic modeling technology to realize extraction of a document topic, wherein the document keywords are extracted from the text according to the keyword classification model based on the classification algorithm specifically comprises the following steps:

and (3) data acquisition: collecting text data sets to be processed, including various document types, such as chapters, reports, news, and the like;

it should be noted that the text data sets may be obtained in a variety of ways, for example, some well-known data set providers include government agencies, universities, research institutions, and online data set repositories, and use web crawler tools to capture text data from the internet;

data cleaning: firstly, cleaning a text data set, including denoising, word segmentation, word stopping removal and word frequency statistics;

it should be noted that word segmentation refers to the segmentation of words from sentences, which is required to be segmented from sentences because there are no obvious separators between words in Chinese; stop words refer to words that are not representative of the subject matter or content of the text and typically include some common conjunctions, prepositions, articles, pronouns, and other widely used functional words.

Keyword marking: adding a tag of whether or not a word or phrase of each text document in the text data set is a keyword, for example, 1 is represented as a keyword, and 0 is represented as a non-keyword;

word vector construction: constructing word vectors for each text document in a text data set to obtain a word vector set, wherein the output of four algorithms such as tf-idf, textrank, LSA, word vec is adopted as the word vector for synthesizing word frequencies and semantic relations among keywords in the process of constructing the word vectors;

tf-idf: tf is the word frequency, the number of occurrences of a word divided by the total number of words in the entire article, the number of words in the article, and |t| is the total number of occurrences of the word R in the article T, and tf is expressed as: tf= |r|/|t|;

the idf value is an inverse frequency word, and the expression is as follows: idf=log (|sr|/|s|+1, where |sr| is the number of documents in the whole document set that contain the word R, and |s| is the number of documents in the whole document set;

the expression for tf-idf values is as follows: tf-idf=tf idf;

the Textrank algorithm is a graph-based ordering algorithm for keyword extraction, the basic principle of which is by creating a graph in which the nodes of the graph represent words or phrases in text and the edges represent co-occurrence relationships between them. The weight of an edge between two nodes may represent the correlation between two words or phrases. In general, window size can be used to determine co-occurrence relationships, i.e., in text, if two words appear within a window, there is an edge between them, and after constructing the graph, an iterative algorithm (similar to PageRank) is used to calculate the score for each node (word or phrase). The score of a node represents its importance, with higher scoring nodes being considered more important. The words or phrases are ordered according to the score of the node to determine the most important words or phrases. These words or phrases may be the result of keyword extraction.

The LSA algorithm is a natural language processing technology based on matrix decomposition and is used for capturing potential semantic structures among document keywords;

word2vec is a natural language processing technique for converting words into vector representations, whose purpose is to capture semantic relationships between words so that words can be represented in vector space and similar words are close to each other in vector space.

The word vector construction process through the algorithm obtains four word vector expression forms, wherein tf-idf and Textrank are word vector expression forms based on word frequency, LSA and word2vec are word vector expression forms related to topics, word vectors used for training are formed by combining word vectors of word frequency and topics, and due to the fact that randomness exists in acquisition of different documents in a text data set, high similarity exists between generated word vectors, the fact that two different documents have high similarity degree in content is shown, if the two different documents are used as word vectors for model training, output of models is excessively concentrated, more complex scenes cannot be dealt with in actual operation, therefore, different documents in a word vector set are evaluated through word embedding distance information and word vector dimension information, and the structure diagram is shown in fig. 2, and the method comprises the following specific steps:

the data acquisition unit is used for acquiring word embedding distance information and word vector dimension information of every two documents in the word vector set, wherein the word embedding distance information comprises a word embedding distance, and the word vector dimension information comprises a word vector maximum dimension difference value;

respectively marking word embedding distance and maximum dimension difference value of word vector as；

The word embedding distance is a distance measure between different word vectors, and is used for measuring the similarity between two articles, and the higher the word embedding distance is, the higher the similarity between the two articles in the text data set is, the higher the probability that one document is deleted from the word vector set is;

in an alternative example, the word embedding distance may be obtained by a Euclidean distance calculation formula, which is expressed as follows:wherein->The components of the two different document word vectors in i dimensions are respectively represented, the value of n is smaller than the largest dimension of the two different document word vectors, for example, the largest dimension of the C document word vectors is 5, the largest dimension of the D document word vectors is 10, and the value of n is 5;

the maximum dimension difference value of the word vector refers to the difference value of the maximum dimension of the two word vectors, and is used for measuring the numerical value difference of the word vectors of the two documents in the dimension, and the higher the maximum dimension difference value of the word vector is, the larger the range difference contained in the content of the two documents is reflected, and the higher the probability that one of the two documents is deleted from the word vector set is;

the model generating unit is used for respectively acquiring word embedding distance and maximum dimension difference value of word vectors of every two documents in the word vector set, generating a document screening index KH through a constraint model by the acquired two indexes, comparing the document screening index with a document screening index threshold, and deleting the documents which are repeated and have higher similarity and are not used as a training set;

it should be noted that, the document screening index is generated through the constraint model according to the word embedding distance and the maximum dimension difference value of the word vector, and the expression according to the document screening index is as follows:wherein->The weight factors are respectively word embedding distance and maximum dimension difference value of word vector, and specific numerical values can be set according to actual conditions; the larger the document screening index, the greater the probability of deleting the evaluated document;

the comparison unit is used for comparing the document screening index with a document screening index threshold value and deleting the documents which are repeated and have higher similarity;

if the document screening index is greater than or equal to the document screening index threshold, deleting the one with the lower maximum dimension of the word vector in the two documents, (randomly deleting the one if the maximum dimension is equal) so as to reduce the redundancy of the training set, and avoid that the output result of the model is too concentrated singly, so that the actual application is not in line with expectations;

if the document screening index is smaller than the document screening index threshold, deleting is not performed, which indicates that the two documents have larger difference in content and can be used as training data;

taking the screened word vector as training data of a model;

after determining the word vectors, a classification model is required to be constructed, and the screened word vectors are used as training data to extract keywords;

it should be noted that, the above classification algorithms are existing mature algorithms, and are not described herein;

before model training, preprocessing data including outliers, missing values and data imbalance processing is needed;

respectively inputting training data into four classification algorithms for training prediction, respectively comprehensively evaluating the performances of the four algorithms by acquiring training consumption information and training result feedback information of the four algorithms, and sequencing the priorities of the four algorithms according to the evaluation results to obtain an algorithm use sequence which is put into use;

the data acquisition unit acquires training consumption information and training result feedback information of the four algorithms;

respectively marking the abnormal time complexity coefficient and the training result F1 value as；

The abnormal time complexity coefficient is used for measuring that the algorithm is very bad under certain input data, the execution time is longer, and the higher the abnormal time complexity coefficient is, the worse the performance of the algorithm is reflected, the more the algorithm is behind the position of the algorithm using sequence;

the acquisition logic of the anomaly time complexity coefficient is as follows:

acquiring algorithm execution time T corresponding to each group of training data by recording algorithm execution time, and calculating average execution timeThe expression is as follows->Where j represents the sequence number of each set of training data,v is a positive integer, and the algorithm execution time T and the average execution time corresponding to each group of training data are calculatedComparing when->Indicating that the average level of algorithm execution time is greater than or equal to the abnormal state, marking it as +.>The method comprises the steps of carrying out a first treatment on the surface of the When->Indicating that the algorithm execution time is in a normal state with less than the average level; execution time according to algorithm in abnormal state +.>Average execution time +.>Calculating an abnormal time complexity coefficient, wherein the expression is as follows:wherein s represents the sequence number of each group of training data corresponding to the algorithm execution time in the abnormal state,/or->R is a positive integer;

the training result F1 value is a harmonic average value of the accuracy and the recall ratio and is used for comprehensively evaluating the training result of the algorithm, and the larger the training result F1 value is, the better the performance of the algorithm is reflected, and the more the algorithm is in front of the position of the algorithm using sequence;

obtaining the quantity of training data correctly predicted by the algorithm as the key words in the training data through the prediction result of the analysis algorithm, the quantity of training data incorrectly predicted by the algorithm as the key words and the quantity of training data not predicted by the algorithm as the related informationThe training data quantity of key words, the training data quantity correctly predicted by the algorithm as key words, the training data quantity incorrectly predicted by the algorithm as key words and the training data quantity not predicted by the algorithm as key words are respectively marked as follows；

The training result F1 value can be calculated by the following formula:wherein->The expression of the accuracy is as follows:；/>the recall is expressed as follows: />；

The model generating unit is used for establishing an algorithm performance evaluation model according to the abnormal time complexity coefficient and the training result F1 value to generate an algorithm performance evaluation index；

The sorting unit is used for sorting the priorities of the four algorithms according to the algorithm performance evaluation indexes to obtain an algorithm use sequence which is put into use;

it should be noted that, according to the abnormal time complexity coefficient and the training result F1 value, an algorithm performance evaluation model is established to generate an algorithm performance evaluation index, wherein the algorithm performance evaluation model is based on the following formula:wherein->The weight factors of the abnormal time complexity coefficient and the training result F1 value are respectively, and specific numerical values can be set according to actual conditions; the larger the algorithm performance evaluation index, the earlier the algorithm is in the algorithm use sequence;

sorting the selected four algorithms from large to small according to the algorithm performance evaluation index to obtain an algorithm use sequence which is put into use, taking the algorithm of the first bit of the algorithm use sequence as a classification algorithm for primary use, and taking a subsequent algorithm as a standby algorithm;

tracking and evaluating the running state of the classification algorithm used for the first time according to a preset time interval in the actual running process of the system;

it should be noted that, the evaluation method of the follow-up tracking is consistent with the evaluation method of the algorithm performance, wherein the difference is that the algorithm processes the electronic file which is actually required to be processed in the running process of the system;

the algorithm performance evaluation unit compares the algorithm performance evaluation index obtained by follow-up tracking evaluation with an algorithm performance evaluation index reference threshold value to judge whether the running state of the algorithm used for the first time has a descending trend or not;

if the algorithm performance evaluation index is greater than or equal to the algorithm performance evaluation index reference threshold, the running state of the current algorithm is good, other operations are not needed, and the running state of the algorithm is continuously tracked and evaluated according to a preset time interval;

if the algorithm performance evaluation index is smaller than the algorithm performance evaluation index reference threshold, the operation state of the current algorithm is poor, the subsequent output is possibly caused to generate errors, the algorithm in use is replaced by the sequence algorithm in the algorithm use sequence, and the replaced algorithm is retrained to be perfected;

it should be noted that the replaced algorithm is retrained and perfected, wherein the training data is from the electronic file transmitted by the manual identification module;

according to the method, the document keywords are extracted from the text according to the keyword classification model based on the two classification algorithms, the method comprises the steps of data acquisition, data cleaning, keyword marking, word vector construction and the like, and after the word vector construction is completed, different documents in a word vector set are evaluated by acquiring word embedding distance information and word vector dimension information, so that redundancy of a training set is reduced, and the situation that the output result of the model is too concentrated singly, and the actual application is not in line with expectations is avoided.

According to the invention, after the word vector is determined, a two-class model is constructed by adopting a mixed mode of multiple classification algorithms, the performance of the algorithms is comprehensively evaluated by acquiring training consumption information and training result feedback information of the algorithms, the algorithms are ranked in priority according to the evaluation result, the first-ranked algorithm is put into use, the running state of the algorithms put into use is continuously evaluated, when the running state of the algorithms has a descending trend, the algorithms in use sequence of the algorithms are replaced by the sequence algorithms, the replaced algorithms are retrained and perfected, and the continuous running of the algorithms with poor running state is avoided, so that the errors of subsequent output occur.

The above formulas are all formulas with dimensions removed and numerical values calculated, the formulas are formulas with a large amount of data collected for software simulation to obtain the latest real situation, and preset parameters in the formulas are set by those skilled in the art according to the actual situation.

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with the embodiments of the present application are all or partially produced. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired or wireless means (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.

It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The intelligent management system for electronic archive information is characterized in that: the system comprises a file classification module, a file identification module, a manual identification module, a file reorganization module, a file encryption module, a search abstract generation module, a circulation file library, an electronic file library and a bulk file database, wherein the modules are connected through signals;

the archiving and identifying module is used for judging whether the classified electronic files need to be archived or not by adopting different identifying modes according to the classified electronic files, and comprises the following steps:

structural document:

archiving, identifying and matching, namely matching the electronic archive title with the key words of the identification rule in an archiving range, and transmitting the electronic archive title to an archive reorganizing module if the matching is successful; if the matching fails, transmitting the matching to a bulk file database;

unstructured document:

text extraction: firstly, scanning texts in a document by a text recognition OCR technology to realize extraction of the texts;

archiving, identifying and matching, namely conducting archiving range keyword matching on the theme extracted from the electronic archive and the identification rule, and transmitting the theme to an archive reorganization module if the matching is successful; if the matching fails, the matching is transmitted to a manual identification module.

2. An electronic archive information intelligent management system according to claim 1, wherein: keyword extraction: extracting document keywords from text according to a keyword classification model based on a classification algorithm further comprises the steps of:

3. An electronic archive information intelligent management system according to claim 2, wherein: after the word vector is constructed, word embedding distance information and word vector dimension information of every two documents in a word vector set are obtained, wherein the word embedding distance information comprises word embedding distances, the word vector dimension information comprises word vector maximum dimension difference values, and the word embedding distances and the word vector maximum dimension difference values are calculated through weighted summation to obtain document screening indexes;

comparing the document screening index to a document screening index threshold;

4. An electronic archive information intelligent management system according to claim 3, wherein: preprocessing training data, respectively inputting the preprocessed training data into a classification algorithm for training prediction, and respectively comprehensively evaluating the performance of the algorithm by acquiring training consumption information and training result feedback information of the algorithm;

5. The intelligent electronic archive information management system of claim 4, wherein: in the actual running process of the system, tracking and evaluating the running state of the classification algorithm used for the first time according to a preset time interval, and comparing an algorithm performance evaluation index obtained by subsequent tracking and evaluating with an algorithm performance evaluation index reference threshold;