CN117216217B

CN117216217B - Intelligent classification and retrieval method for files

Info

Publication number: CN117216217B
Application number: CN202311204538.XA
Authority: CN
Inventors: 郭雪娇
Original assignee: Shandong Huishangmai Network Technology Co ltd
Current assignee: Shandong Huishangmai Network Technology Co ltd
Priority date: 2023-09-19
Filing date: 2023-09-19
Publication date: 2024-03-22
Anticipated expiration: 2043-09-19
Also published as: CN117216217A

Abstract

The invention relates to the technical field of classified retrieval, and discloses an intelligent archive classification and retrieval method, which comprises the following steps: constructing the electronic archive as a word graph model, and calculating TFIDF characteristics and candidate word position characteristics of each candidate word in the electronic archive according to the word graph model; calculating a probability transition matrix among candidate words in the electronic file; performing iterative computation on the probability transition matrix to obtain an initial score of the candidate word; calculating to obtain the hierarchical characteristics and average information entropy characteristics of the candidate words in the electronic contract based on the extracted K nuclear map; on the basis of the initial score of the candidate word, determining an electronic archive keyword by fusing the hierarchical characteristics and the average information entropy characteristics; and classifying files according to the electronic file keywords and supporting quick file retrieval based on the keywords. The invention determines the keywords in the electronic file based on the position importance, word frequency importance, association degree with other candidate words and information quantity of the candidate words, and realizes the classification and retrieval processing of the electronic file.

Description

Intelligent classification and retrieval method for files

Technical Field

The invention relates to the technical field of classified retrieval, in particular to an intelligent file classifying and retrieving method.

Background

With the rapid development of information technology, the informatization degree of colleges and universities is greatly improved, the number of electronic files of colleges and universities is rapidly expanded, the electronic file management of the colleges and universities faces great challenges and opportunities, and more efficient and intelligent management modes are needed to meet the increasingly complex demands. The traditional college archive management mode is mainly manually operated, is difficult to ensure the safety and the integrity of information, is easy to generate errors, is difficult to retrieve, is difficult to meet the modern requirements of informatization construction of the colleges and universities, has the defects that the defects are more remarkable, and an intelligent and automatic electronic archive management method is needed. Aiming at the problem, the invention provides an intelligent file classifying and searching method, which realizes intelligent understanding and self-adaptive classified searching of the electronic file by an artificial intelligence technology and improves the intelligent management level of the electronic file.

Disclosure of Invention

In view of this, the present invention provides an intelligent classification and retrieval method for files, which aims to: 1) Calculating the position distance between candidate words according to the sentence number and the spacing word number of different candidate words in an electronic file, establishing a side relation between the candidate words based on the comparison of the position distance and a window threshold value, constructing a word graph model, generating candidate word position features representing the position importance and the word frequency importance of the candidate words, TFIDF features, calculating the jump probability between the candidate words according to the generated features, initializing a probability transition matrix, determining the importance weight of the candidate words according to the iterative calculation result of the probability transition matrix, serving as the initial score of the candidate words, and calculating the quantification of the importance of the candidate words; 2) According to the node degree of the candidate word in the word graph model, determining the belonging level of the candidate word in the K kernel subgraph, wherein the higher the node degree is, the higher the association degree of the candidate word and other candidate words is, the higher the belonging level is, and filtering out the candidate words with more occurrence times and less information quantity by utilizing the average information entropy characteristic, so as to realize the determination of the keywords in the electronic archive, classify the archive according to the keywords of the electronic archive, and support the quick retrieval of the archive based on the keywords.

The invention provides an intelligent file classifying and searching method, which comprises the following steps:

s1: constructing the electronic archive as a word graph model, and calculating TFIDF characteristics and candidate word position characteristics of each candidate word in the electronic archive according to the word graph model;

s2: calculating a probability transition matrix among candidate words in the electronic file according to the TFIDF characteristics of the electronic file and the position characteristics of the candidate words;

s3: performing iterative computation on the probability transition matrix to obtain an initial score of the candidate word;

s4: extracting a K nuclear map from the constructed word map model;

s5: calculating to obtain the hierarchical characteristics and average information entropy characteristics of the candidate words in the electronic contract based on the extracted K nuclear map;

s6: on the basis of the initial score of the candidate word, determining an electronic archive keyword by fusing the hierarchical characteristics and the average information entropy characteristics;

s7: and classifying files according to the electronic file keywords and supporting quick file retrieval based on the keywords.

As a further improvement of the present invention:

optionally, in the step S1, the constructing the electronic archive as a word graph model includes:

constructing the electronic file as a word graph model, wherein the ith electronic fileThe word graph model construction flow is as follows:

s11: to electronic filePerforming sentence segmentation processing, wherein the sentence segmentation processing result is as follows:

wherein:

representing electronic archive->Word segmentation processing result of jth sentence in (a),/j>Representing electronic archive->Is a sentence number;

representing word segmentation processing result->The%>Words(s) of (E)>Representing word segmentation processing result->Word total number in (a);

in the embodiment of the invention, punctuation marks are utilized to perform sentence segmentation on the electronic file, a Jieba word segmentation tool is utilized to perform word segmentation on each sentence of the electronic file, and stop words after the word segmentation are removed to obtain sentence segmentation processing results, wherein the stop words comprise word and gas words, prepositions and the like;

s12: from electronic filesExtracting nouns from sentence-word segmentation processing results as electronic files +.>Candidate word of (2) electronic archive->The de-duplication candidate word set of (a) is:

wherein:

representing electronic archive->The%>Candidate words (L)>Representing electronic archive->Is a candidate word count of (1);

s13: taking the candidate words as nodes of the word graph model, calculating the position distances between different nodes, if the position distance between the two nodes is smaller than a preset window threshold value, then the edges exist between the two nodes, otherwise, the edges do not exist, wherein the candidate words are selectedAnd->The position distance between the two is as follows:

wherein:

an exponential function that is based on a natural constant;

representing candidate words->And->The distance between the two positions>And->；

Representing electronic archive->Candidate word +.>And (3) withThe number of words in between;

representing candidate words->In electronic files->The number of occurrences in the sentence-word processing result,i.e. candidate word->In->The occurrence of each sentence;Representing candidate wordsIn electronic files->Number of appearance sentences in sentence segmentation processing result,/->；

In the embodiment of the invention, the same candidate word may repeatedly appear at different positions in the electronic file, so that two candidate word positions with the minimum position distance are selected, and the position distances of the two candidate words are calculated;

s14: the nodes and the side information between the nodes are formed into a word graph model, and then the electronic file is obtainedThe formed word graph model is as follows:

wherein:

representing electronic archive->The formed word graph model comprises ++>Candidate word node information and side information between candidate word nodes;

representing side information of different candidate words between word graph models;

if it isLess than a preset window threshold, +.>Representing candidate wordsAnd->There are edges in the word graph model, otherwise the candidate word +.>And->There are no edges in the word graph model.

Optionally, in the step S1, calculating TFIDF features and candidate word position features of each candidate word in the electronic archive according to the word graph model includes:

calculating TFIDF characteristics and candidate word position characteristics of each candidate word in the electronic archive according to the constructed word graph model, wherein the electronic archiveMiddle->Candidate words->The TFIDF feature and candidate word position feature calculation formula is:

wherein:

representing candidate words->In electronic files->The number of occurrences in the sentence-word segmentation processing result;

representing electronic archive->The total number of words in the sentence segmentation processing result; in an embodiment of the present invention, in the present invention,；

n represents the total number of electronic files;

indicating the presence of candidate words->Electronic archive part of (2);

representing candidate words->TFIDF characteristics of (a);

representing candidate words->Is a candidate word location feature of (a).

Optionally, the step S2 of calculating a probability transition matrix between candidate words in the electronic archive includes:

calculating a probability transition matrix between candidate words in the electronic archive according to the calculated TFIDF characteristics of the electronic archive and the candidate word position characteristics, wherein the electronic archiveThe probability transition matrix calculation flow of (1) is as follows:

calculating to obtain electronic fileCandidate words->Jump to candidate word +.>Jump probability>：

Wherein:

representing candidate words->Jump weights of (2);

generating an electronic archive according to the jump probabilityIs a probability transition matrix of (a):

wherein:

representing electronic archive->Candidate words->Probability transition matrices of (a);

representing electronic archive->Is a probability transition matrix of (a).

Optionally, in the step S3, iterative computation is performed on the probability transition matrix to obtain an initial score of the candidate word, including:

iterative computation is carried out on the probability transition matrix to obtain initial scores of candidate words, wherein the initial scores are electronic filesThe initial score calculation flow of the candidate words is as follows:

s31: according to the electronic fileProbability transition matrix->Generating an electronic archive->Weights of each candidate word in (a) wherein the candidate word is +.>The weight of (2) is:

wherein:

representing electronic archive->Candidate words->Weights of (2);

represents the damping coefficient, will->Set to 0.8;

s32: composing an electronic fileMiddle->Weight matrix of individual candidate words +.>：

S33: setting the current iteration times of the weight matrix as t, and setting the initial value of t as 0, wherein the t-th iteration result of the weight matrix is that；

S34: iterating the weight matrix, wherein the iteration formula of the weight matrix is as follows:

if it isIf the iteration number is smaller than the preset iteration threshold value, the iteration is stopped, and +.>As the final weight matrix, the element value of the mth column of the final weight matrix is the electronic file +.>Candidate words->Initial score of (2)；

No order of noThe process returns to step S34.

Optionally, in the step S4, a K-nucleolus graph is extracted from the word graph model, including:

extracting a K nuclear graph from the constructed word graph model, wherein the word graph modelThe extraction flow of the K nuclear map in the (2) is as follows:

s41: initializing k=0, initializing a setWord graph model->Store to the collection->In, and initialize the collectionSet->Is empty;

s42: pair aggregationTraversing all the word graph model nodes in the model, screening to obtain word graph model nodes with node degree smaller than k, and storing the screened word graph model nodes into a set ∈>As the kth level node in the K kernel graph;

selecting nodes of the word graph model and edges connected with the nodes from a setPerforming medium recursion deletion;

s43: if it isLess than K, let->The process returns to step S42. In the embodiment of the invention, the node degree is the number of edges connected with the node.

Optionally, in the step S5, calculating, based on the K-nucleograms, hierarchical features and average information entropy features of candidate words in the electronic contract, including:

calculating to obtain hierarchical characteristics and average information entropy characteristics of candidate words in the electronic contract based on the extracted K nuclear graph, wherein the electronic archiveCandidate words->The calculation flow of the hierarchical features and the average information entropy features is as follows:

acquiring electronic filesCandidate words->Hierarchical features of the corresponding node in the K kernel sub-graph +.>：

Wherein:

representation of word graph model->Chinese and candidate words->The corresponding node has a node set of edges,e represents node set +.>Any node in (a);

representing the number of layers of the computing node in the K kernel subgraph;

acquiring electronic filesCandidate words->Average information entropy feature->：

Wherein:

n represents the total number of electronic files;

representing candidate words->The number of occurrences in all electronic archive clause word segmentation processing results.

Optionally, in the step S6, determining the electronic archive keyword by fusing the hierarchical feature and the average information entropy feature based on the initial score of the candidate word includes:

determining an electronic archive keyword based on the candidate word initial score by fusing hierarchical features and average information entropy features, wherein the electronic archiveThe determining procedure of the medium keywords is as follows:

computing electronic filesKeyword score of candidate words:

wherein:

representing electronic archive->Candidate words->Keyword scores of (2);

representing electronic archive->Candidate words->Is a score of the initial score of (a);

representing electronic archive->Candidate words->Is a fusion hierarchy feature of (1);

representing electronic archive->Candidate words->Is characterized by the average information entropy;

selecting candidate words with highest keyword score as an electronic file5 candidate words with the highest keyword score are selected as the electronic archive +.>Is a key word of (a).

Optionally, in the step S7, file classification is performed according to the electronic file keywords, and file quick retrieval based on the keywords is performed, including:

electronic archive based on candidate word pairs with highest keyword scoresClassifying and adding electronic files->Keywords as electronic archive->And (5) searching the electronic file rapidly by inquiring the search term.

In order to solve the above-described problems, the present invention provides an electronic apparatus including:

a memory storing at least one instruction;

the communication interface is used for realizing the communication of the electronic equipment; and

And the processor executes the instructions stored in the memory to realize the intelligent classification and retrieval method of the files.

In order to solve the above-mentioned problems, the present invention further provides a computer readable storage medium, where at least one instruction is stored, where the at least one instruction is executed by a processor in an electronic device to implement the above-mentioned archive intelligent classification and retrieval method.

Compared with the prior art, the invention provides an intelligent file classifying and searching method, which has the following advantages:

firstly, the scheme provides a word importance quantization mode, and a probability transition matrix among candidate words in an electronic file is calculated according to the TFIDF characteristics of the electronic file and the position characteristics of the candidate words, wherein the electronic fileThe probability transition matrix calculation flow of (1) is as follows:

Wherein:Representing candidate words->Jump weights of (2); generating an electronic archive according to the probability of jumping>Is a probability transition matrix of (a):

wherein:Representing electronic archive->Candidate words->Probability transition matrices of (a);Representing electronic archive->Is a probability transition matrix of (a). Iterative calculation is carried out on the probability transition matrix to obtain an initial score of the candidate word, wherein the initial score is electronic archive +.>The initial score calculation flow of the candidate words is as follows: according to electronic files->Probability transition matrix->Generating an electronic archive->Weights of each candidate word in (a) wherein the candidate word is +.>The weight of (2) is:

wherein:Representing electronic archive->Candidate words->Weights of (2);Represents the damping coefficient, will->Set to 0.8; form an electronic file->Middle->Weight matrix of individual candidate words +.>：

Setting the current iteration times of the weight matrix as t, and setting the initial value of t as 0, wherein the t-th iteration result of the weight matrix is +.>The method comprises the steps of carrying out a first treatment on the surface of the Iterating the weight matrix, wherein the iteration formula of the weight matrix is as follows:

if->If the iteration number is smaller than the preset iteration threshold value, the iteration is stopped, and +.>As the final weight matrix, the element value of the mth column of the final weight matrix is the electronic file +.>Candidate words->Is +.>. According to the scheme, the position distance between candidate words is obtained through calculation according to the sentence numbers and the interval word numbers of different candidate words in an electronic file, the edge relation between the candidate words is established based on comparison of the position distance and a window threshold value, a word graph model is obtained through construction, candidate word position features representing the position importance and the word frequency importance of the candidate words are generated, TFIDF features are calculated according to the generated features, the jump probability between the candidate words is obtained through calculation, a probability transition matrix is initialized, the importance weight of the candidate words is determined according to the iterative calculation result of the probability transition matrix, and the importance quantization of the candidate words is calculated as the initial score of the candidate words.

Meanwhile, the scheme provides a method for determining the keywords of the electronic file, which extracts the K nuclear map from the constructed word map model, wherein the word map modelThe extraction flow of the K nuclear map in the (2) is as follows: initializing k=0, initializing set +.>Word graph model->Store to the collection->And initialize the set +.>Set->Is empty; for the collection->Traversing all the word graph model nodes in the model, screening to obtain word graph model nodes with node degree smaller than k, and storing the screened word graph model nodes into a set ∈>As the kth level node in the K kernel graph; selecting nodes of the word graph model and edges connected with the nodes from the set +.>Performing medium recursion deletion; if->Less than K, let->. Calculating to obtain the hierarchical characteristics and average information entropy characteristics of candidate words in the electronic contract based on the extracted K nuclear graph, wherein the electronic archive +.>Candidate words->The calculation flow of the hierarchical features and the average information entropy features is as follows:

Wherein:Representation of word graph model->Chinese and candidate wordsNode set with edges corresponding to the nodes, +.>E represents node set +.>Any node in (a);representing the number of layers of the computing node in the K kernel subgraph; obtain electronic archive->Candidate words->Average information entropy feature->：

Wherein:Representing candidate words->In electronic files->The number of occurrences in the sentence-word segmentation processing result; n represents the total number of electronic files;representing candidate words->The number of occurrences in all electronic archive clause word segmentation processing results. According to the scheme, the belonging level of the candidate word in the K kernel subgraph is determined according to the node degree of the candidate word in the word graph model, wherein the higher the node degree is, the higher the association degree of the candidate word and other candidate words is, the higher the belonging level is, the candidate words with more occurrence times and less information quantity are filtered out by utilizing the average information entropy feature, the determination of the keywords in the electronic archive is realized, the archive classification is carried out according to the electronic archive keywords, and the quick archive retrieval based on the keywords is supported.

Drawings

FIG. 1 is a flowchart of an intelligent classification and retrieval method for files according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an electronic device for implementing an intelligent classification and retrieval method of files according to an embodiment of the invention.

In the figure: 1 an electronic device, 10 a processor, 11 a memory, 12 a program, 13 a communication interface.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The embodiment of the application provides an intelligent file classification and retrieval method. The execution subject of the archive intelligent classification and retrieval method includes, but is not limited to, at least one of a server, a terminal, and the like, which can be configured to execute the method provided by the embodiment of the application. In other words, the archive intelligent classification and retrieval method may be performed by software or hardware installed in a terminal device or a server device, where the software may be a blockchain platform. The service end includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.

Example 1:

s1: and constructing the electronic archive as a word graph model, and calculating TFIDF characteristics and candidate word position characteristics of each candidate word in the electronic archive according to the word graph model.

In the step S1, the electronic archive is constructed into a word graph model, which comprises the following steps:

wherein:

s13: using the candidate words as nodes of the word graph model, and calculating the position distances among different nodesIf the position distance between the two nodes is smaller than the preset window threshold value, an edge exists between the two nodes, otherwise, no edge exists, wherein the candidate word exists between the two nodesAnd->The position distance between the two is as follows:

wherein:

an exponential function that is based on a natural constant;

wherein:

In the step S1, TFIDF characteristics and candidate word position characteristics of each candidate word in the electronic archive are calculated according to the word graph model, and the method comprises the following steps:

wherein:

n represents the total number of electronic files;

indicating the presence of candidate words->Electronic archive part of (2);

representing candidate words->TFIDF characteristics of (a);

representing candidate words->Is a candidate word location feature of (a).

S2: and calculating a probability transition matrix among candidate words in the electronic archive according to the TFIDF characteristics of the electronic archive and the candidate word position characteristics.

And S2, calculating to obtain a probability transition matrix among candidate words in the electronic archive, wherein the probability transition matrix comprises the following components:

calculating a probability transition matrix between candidate words in the electronic archive according to the calculated TFIDF characteristics of the electronic archive and the candidate word position characteristics, wherein the electronic archiveThe probability transition matrix calculation flow of (1) is as follows: />

Wherein:

representing candidate words->Jump weights of (2);

wherein:

representing electronic archive->Is a probability transition matrix of (a).

S3: and carrying out iterative computation on the probability transition matrix to obtain the initial score of the candidate word.

And in the step S3, iterative computation is carried out on the probability transition matrix to obtain an initial score of the candidate word, and the method comprises the following steps:

wherein:

representing electronic archive->Candidate words->Weights of (2);

represents the damping coefficient, will->Set to 0.8;

if->If the iteration number is smaller than the preset iteration threshold value, the iteration is stopped, and +.>As the final weight matrix, the element value of the mth column of the final weight matrix is the electronic file +.>Candidate words->Is +.>；

No order of noThe process returns to step S34.

S4: and extracting a K nuclear map from the constructed word map model.

And in the step S4, extracting a K nucleolus graph from the word graph model, wherein the K nucleolus graph comprises the following steps of:

s43: if it isLess than K, let->The process returns to step S42.

S5: and calculating the hierarchical characteristics and average information entropy characteristics of the candidate words in the electronic contract based on the extracted K nuclear map.

And in the step S5, calculating to obtain the hierarchical characteristics and average information entropy characteristics of the candidate words in the electronic contract based on the K nuclear map, wherein the method comprises the following steps:

Wherein:

representing candidate words->In electronic files->The number of occurrences in the sentence-word segmentation processing result; />

n represents the total number of electronic files;

S6: and on the basis of the initial score of the candidate word, determining the electronic archive key word by fusing the hierarchical features and the average information entropy features.

In the step S6, based on the candidate word initial score, the hierarchical features and the average information entropy features are fused to determine the electronic archive keywords, which comprises the following steps:

computing electronic filesKeyword score of candidate words:

wherein:

representing electronic archive->Candidate words->Keyword scores of (2);

In the step S7, file classification is performed according to the electronic file keywords, and file quick retrieval based on the keywords is performed, including:

Example 2:

fig. 2 is a schematic structural diagram of an electronic device for implementing an intelligent classification and retrieval method for files according to an embodiment of the invention.

The electronic device 1 may comprise a processor 10, a memory 11, a communication interface 13 and a bus, and may further comprise a computer program, such as program 12, stored in the memory 11 and executable on the processor 10.

The memory 11 includes at least one type of readable storage medium, including flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may in other embodiments also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of the program 12, but also for temporarily storing data that has been output or is to be output.

The processor 10 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The processor 10 is a control unit (control unit) of the electronic device, connects various components of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device 1 and processes data by running or executing programs or modules (programs 12 for implementing intelligent classification and retrieval of files, etc.) stored in the memory 11, and calling data stored in the memory 11.

The communication interface 13 may comprise a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used to establish a communication connection between the electronic device 1 and other electronic devices and to enable connection communication between internal components of the electronic device.

The bus may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 11 and at least one processor 10 etc.

Fig. 2 shows only an electronic device with components, it being understood by a person skilled in the art that the structure shown in fig. 2 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or may be arranged in different components.

For example, although not shown, the electronic device 1 may further include a power source (such as a battery) for supplying power to each component, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 1 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.

The electronic device 1 may optionally further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.

It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.

The program 12 stored in the memory 11 of the electronic device 1 is a combination of instructions that, when executed in the processor 10, may implement:

constructing the electronic archive as a word graph model, and calculating TFIDF characteristics and candidate word position characteristics of each candidate word in the electronic archive according to the word graph model;

calculating a probability transition matrix among candidate words in the electronic file according to the TFIDF characteristics of the electronic file and the position characteristics of the candidate words;

performing iterative computation on the probability transition matrix to obtain an initial score of the candidate word;

extracting a K nuclear map from the constructed word map model;

calculating to obtain the hierarchical characteristics and average information entropy characteristics of the candidate words in the electronic contract based on the extracted K nuclear map;

on the basis of the initial score of the candidate word, determining an electronic archive keyword by fusing the hierarchical characteristics and the average information entropy characteristics;

and classifying files according to the electronic file keywords and supporting quick file retrieval based on the keywords.

Specifically, the specific implementation method of the above instruction by the processor 10 may refer to descriptions of related steps in the corresponding embodiments of fig. 1 to 2, which are not repeated herein.

It should be noted that, the foregoing reference numerals of the embodiments of the present invention are merely for describing the embodiments, and do not represent the advantages and disadvantages of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. An intelligent classification and retrieval method for files is characterized by comprising the following steps:

s4: extracting a K nuclear map from the constructed word map model;

s7: carrying out file classification according to the electronic file keywords and supporting quick file retrieval based on the keywords;

wherein:

s13: candidate words as word graph modelCalculating the position distance between different nodes, if the position distance between two nodes is smaller than the preset window threshold value, then there is an edge between the two nodes, otherwise there is no edge, wherein the candidate wordAnd->The position distance between the two is as follows:

wherein:

an exponential function that is based on a natural constant;

representing candidate words->And->The distance between the two positions>And (2) and；

representing electronic archive->Candidate word +.>And->The number of words in between;

wherein:

if it isLess than a preset window threshold, +.>Representing candidate wordsAnd->There are edges in the word graph model, otherwise the candidate word +.>And->No edges exist in the word graph model;

wherein:

representing electronic archive->The total number of words in the sentence segmentation processing result;

n represents the total number of electronic files;

indicating the presence of candidate words->Electronic archive part of (2);

representing candidate words->TFIDF characteristics of (a);

representing candidate words->Is a candidate word location feature of (a).

2. An intelligent archive classification and retrieval method according to claim 1, wherein the step S2 of calculating a probability transition matrix between candidate words in the electronic archive comprises:

calculating to obtain electronic fileCandidate words->Jump to candidate word +.>Is to jump probability of (a)：

Wherein:

representing candidate words->Jump weights of (2);

wherein:

representing electronic archive->Is a probability transition matrix of (a).

3. An archive intelligent classification and retrieval method according to claim 2, wherein in step S3, iterative computation is performed on the probability transition matrix to obtain an initial score of the candidate word, including:

wherein:

representing electronic archive->Candidate words->Weights of (2);

represents the damping coefficient, will->Set to 0.8;

if it isIf the iteration number is smaller than the preset iteration threshold value, the iteration is stopped, and +.>As the final weight matrix, the element value of the mth column of the final weight matrix is the electronic file +.>Candidate words->Is +.>；

No order of noThe process returns to step S34.

4. The intelligent archive classification and retrieval method according to claim 1, wherein the step S4 of extracting K nucleograms from the word graph model comprises:

s41: initializing k=0, initializing a setWord graph model->Store to the collection->And initialize the set +.>Set->Is empty;

s43: if it isLess than K, let->The process returns to step S42.

5. The intelligent archive classification and retrieval method according to claim 4, wherein in the step S5, the hierarchical features and the average information entropy features of the candidate words in the electronic contract are calculated based on the K-epipolar graph, and the method comprises the following steps:

Wherein:

n represents the total number of electronic files;

6. The intelligent archive classification and retrieval method according to claim 1, wherein in the step S6, based on the initial score of the candidate word, determining the electronic archive keyword by fusing the hierarchical feature and the average information entropy feature comprises:

computing electronic filesKeyword score of candidate words:

wherein:

representing electronic archive->Candidate words->Keyword scores of (2);

7. The intelligent archive classification and retrieval method according to claim 6, wherein in step S7, archive classification is performed according to electronic archive keywords and quick archive retrieval based on the keywords is performed, comprising: