CN117216217B - Intelligent classification and retrieval method for files - Google Patents

Intelligent classification and retrieval method for files Download PDF

Info

Publication number
CN117216217B
CN117216217B CN202311204538.XA CN202311204538A CN117216217B CN 117216217 B CN117216217 B CN 117216217B CN 202311204538 A CN202311204538 A CN 202311204538A CN 117216217 B CN117216217 B CN 117216217B
Authority
CN
China
Prior art keywords
word
electronic
candidate
archive
candidate words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311204538.XA
Other languages
Chinese (zh)
Other versions
CN117216217A (en
Inventor
郭雪娇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Huishangmai Network Technology Co ltd
Original Assignee
Shandong Huishangmai Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Huishangmai Network Technology Co ltd filed Critical Shandong Huishangmai Network Technology Co ltd
Priority to CN202311204538.XA priority Critical patent/CN117216217B/en
Publication of CN117216217A publication Critical patent/CN117216217A/en
Application granted granted Critical
Publication of CN117216217B publication Critical patent/CN117216217B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of classified retrieval, and discloses an intelligent archive classification and retrieval method, which comprises the following steps: constructing the electronic archive as a word graph model, and calculating TFIDF characteristics and candidate word position characteristics of each candidate word in the electronic archive according to the word graph model; calculating a probability transition matrix among candidate words in the electronic file; performing iterative computation on the probability transition matrix to obtain an initial score of the candidate word; calculating to obtain the hierarchical characteristics and average information entropy characteristics of the candidate words in the electronic contract based on the extracted K nuclear map; on the basis of the initial score of the candidate word, determining an electronic archive keyword by fusing the hierarchical characteristics and the average information entropy characteristics; and classifying files according to the electronic file keywords and supporting quick file retrieval based on the keywords. The invention determines the keywords in the electronic file based on the position importance, word frequency importance, association degree with other candidate words and information quantity of the candidate words, and realizes the classification and retrieval processing of the electronic file.

Description

Intelligent classification and retrieval method for files
Technical Field
The invention relates to the technical field of classified retrieval, in particular to an intelligent file classifying and retrieving method.
Background
With the rapid development of information technology, the informatization degree of colleges and universities is greatly improved, the number of electronic files of colleges and universities is rapidly expanded, the electronic file management of the colleges and universities faces great challenges and opportunities, and more efficient and intelligent management modes are needed to meet the increasingly complex demands. The traditional college archive management mode is mainly manually operated, is difficult to ensure the safety and the integrity of information, is easy to generate errors, is difficult to retrieve, is difficult to meet the modern requirements of informatization construction of the colleges and universities, has the defects that the defects are more remarkable, and an intelligent and automatic electronic archive management method is needed. Aiming at the problem, the invention provides an intelligent file classifying and searching method, which realizes intelligent understanding and self-adaptive classified searching of the electronic file by an artificial intelligence technology and improves the intelligent management level of the electronic file.
Disclosure of Invention
In view of this, the present invention provides an intelligent classification and retrieval method for files, which aims to: 1) Calculating the position distance between candidate words according to the sentence number and the spacing word number of different candidate words in an electronic file, establishing a side relation between the candidate words based on the comparison of the position distance and a window threshold value, constructing a word graph model, generating candidate word position features representing the position importance and the word frequency importance of the candidate words, TFIDF features, calculating the jump probability between the candidate words according to the generated features, initializing a probability transition matrix, determining the importance weight of the candidate words according to the iterative calculation result of the probability transition matrix, serving as the initial score of the candidate words, and calculating the quantification of the importance of the candidate words; 2) According to the node degree of the candidate word in the word graph model, determining the belonging level of the candidate word in the K kernel subgraph, wherein the higher the node degree is, the higher the association degree of the candidate word and other candidate words is, the higher the belonging level is, and filtering out the candidate words with more occurrence times and less information quantity by utilizing the average information entropy characteristic, so as to realize the determination of the keywords in the electronic archive, classify the archive according to the keywords of the electronic archive, and support the quick retrieval of the archive based on the keywords.
The invention provides an intelligent file classifying and searching method, which comprises the following steps:
s1: constructing the electronic archive as a word graph model, and calculating TFIDF characteristics and candidate word position characteristics of each candidate word in the electronic archive according to the word graph model;
s2: calculating a probability transition matrix among candidate words in the electronic file according to the TFIDF characteristics of the electronic file and the position characteristics of the candidate words;
s3: performing iterative computation on the probability transition matrix to obtain an initial score of the candidate word;
s4: extracting a K nuclear map from the constructed word map model;
s5: calculating to obtain the hierarchical characteristics and average information entropy characteristics of the candidate words in the electronic contract based on the extracted K nuclear map;
s6: on the basis of the initial score of the candidate word, determining an electronic archive keyword by fusing the hierarchical characteristics and the average information entropy characteristics;
s7: and classifying files according to the electronic file keywords and supporting quick file retrieval based on the keywords.
As a further improvement of the present invention:
optionally, in the step S1, the constructing the electronic archive as a word graph model includes:
constructing the electronic file as a word graph model, wherein the ith electronic fileThe word graph model construction flow is as follows:
s11: to electronic filePerforming sentence segmentation processing, wherein the sentence segmentation processing result is as follows:
wherein:
representing electronic archive->Word segmentation processing result of jth sentence in (a),/j>Representing electronic archive->Is a sentence number;
representing word segmentation processing result->The%>Words(s) of (E)>Representing word segmentation processing result->Word total number in (a);
in the embodiment of the invention, punctuation marks are utilized to perform sentence segmentation on the electronic file, a Jieba word segmentation tool is utilized to perform word segmentation on each sentence of the electronic file, and stop words after the word segmentation are removed to obtain sentence segmentation processing results, wherein the stop words comprise word and gas words, prepositions and the like;
s12: from electronic filesExtracting nouns from sentence-word segmentation processing results as electronic files +.>Candidate word of (2) electronic archive->The de-duplication candidate word set of (a) is:
wherein:
representing electronic archive->The%>Candidate words (L)>Representing electronic archive->Is a candidate word count of (1);
s13: taking the candidate words as nodes of the word graph model, calculating the position distances between different nodes, if the position distance between the two nodes is smaller than a preset window threshold value, then the edges exist between the two nodes, otherwise, the edges do not exist, wherein the candidate words are selectedAnd->The position distance between the two is as follows:
wherein:
an exponential function that is based on a natural constant;
representing candidate words->And->The distance between the two positions>And->
Representing electronic archive->Candidate word +.>And (3) withThe number of words in between;
representing candidate words->In electronic files->The number of occurrences in the sentence-word processing result,i.e. candidate word->In->The occurrence of each sentence;Representing candidate wordsIn electronic files->Number of appearance sentences in sentence segmentation processing result,/->
In the embodiment of the invention, the same candidate word may repeatedly appear at different positions in the electronic file, so that two candidate word positions with the minimum position distance are selected, and the position distances of the two candidate words are calculated;
s14: the nodes and the side information between the nodes are formed into a word graph model, and then the electronic file is obtainedThe formed word graph model is as follows:
wherein:
representing electronic archive->The formed word graph model comprises ++>Candidate word node information and side information between candidate word nodes;
representing side information of different candidate words between word graph models;
if it isLess than a preset window threshold, +.>Representing candidate wordsAnd->There are edges in the word graph model, otherwise the candidate word +.>And->There are no edges in the word graph model.
Optionally, in the step S1, calculating TFIDF features and candidate word position features of each candidate word in the electronic archive according to the word graph model includes:
calculating TFIDF characteristics and candidate word position characteristics of each candidate word in the electronic archive according to the constructed word graph model, wherein the electronic archiveMiddle->Candidate words->The TFIDF feature and candidate word position feature calculation formula is:
wherein:
representing candidate words->In electronic files->The number of occurrences in the sentence-word segmentation processing result;
representing electronic archive->The total number of words in the sentence segmentation processing result; in an embodiment of the present invention, in the present invention,
n represents the total number of electronic files;
indicating the presence of candidate words->Electronic archive part of (2);
representing candidate words->TFIDF characteristics of (a);
representing candidate words->Is a candidate word location feature of (a).
Optionally, the step S2 of calculating a probability transition matrix between candidate words in the electronic archive includes:
calculating a probability transition matrix between candidate words in the electronic archive according to the calculated TFIDF characteristics of the electronic archive and the candidate word position characteristics, wherein the electronic archiveThe probability transition matrix calculation flow of (1) is as follows:
calculating to obtain electronic fileCandidate words->Jump to candidate word +.>Jump probability>
Wherein:
representing candidate words->Jump weights of (2);
generating an electronic archive according to the jump probabilityIs a probability transition matrix of (a):
wherein:
representing electronic archive->Candidate words->Probability transition matrices of (a);
representing electronic archive->Is a probability transition matrix of (a).
Optionally, in the step S3, iterative computation is performed on the probability transition matrix to obtain an initial score of the candidate word, including:
iterative computation is carried out on the probability transition matrix to obtain initial scores of candidate words, wherein the initial scores are electronic filesThe initial score calculation flow of the candidate words is as follows:
s31: according to the electronic fileProbability transition matrix->Generating an electronic archive->Weights of each candidate word in (a) wherein the candidate word is +.>The weight of (2) is:
wherein:
representing electronic archive->Candidate words->Weights of (2);
represents the damping coefficient, will->Set to 0.8;
s32: composing an electronic fileMiddle->Weight matrix of individual candidate words +.>
S33: setting the current iteration times of the weight matrix as t, and setting the initial value of t as 0, wherein the t-th iteration result of the weight matrix is that
S34: iterating the weight matrix, wherein the iteration formula of the weight matrix is as follows:
if it isIf the iteration number is smaller than the preset iteration threshold value, the iteration is stopped, and +.>As the final weight matrix, the element value of the mth column of the final weight matrix is the electronic file +.>Candidate words->Initial score of (2)
No order of noThe process returns to step S34.
Optionally, in the step S4, a K-nucleolus graph is extracted from the word graph model, including:
extracting a K nuclear graph from the constructed word graph model, wherein the word graph modelThe extraction flow of the K nuclear map in the (2) is as follows:
s41: initializing k=0, initializing a setWord graph model->Store to the collection->In, and initialize the collectionSet->Is empty;
s42: pair aggregationTraversing all the word graph model nodes in the model, screening to obtain word graph model nodes with node degree smaller than k, and storing the screened word graph model nodes into a set ∈>As the kth level node in the K kernel graph;
selecting nodes of the word graph model and edges connected with the nodes from a setPerforming medium recursion deletion;
s43: if it isLess than K, let->The process returns to step S42. In the embodiment of the invention, the node degree is the number of edges connected with the node.
Optionally, in the step S5, calculating, based on the K-nucleograms, hierarchical features and average information entropy features of candidate words in the electronic contract, including:
calculating to obtain hierarchical characteristics and average information entropy characteristics of candidate words in the electronic contract based on the extracted K nuclear graph, wherein the electronic archiveCandidate words->The calculation flow of the hierarchical features and the average information entropy features is as follows:
acquiring electronic filesCandidate words->Hierarchical features of the corresponding node in the K kernel sub-graph +.>
Wherein:
representation of word graph model->Chinese and candidate words->The corresponding node has a node set of edges,e represents node set +.>Any node in (a);
representing the number of layers of the computing node in the K kernel subgraph;
acquiring electronic filesCandidate words->Average information entropy feature->
Wherein:
representing candidate words->In electronic files->The number of occurrences in the sentence-word segmentation processing result;
n represents the total number of electronic files;
representing candidate words->The number of occurrences in all electronic archive clause word segmentation processing results.
Optionally, in the step S6, determining the electronic archive keyword by fusing the hierarchical feature and the average information entropy feature based on the initial score of the candidate word includes:
determining an electronic archive keyword based on the candidate word initial score by fusing hierarchical features and average information entropy features, wherein the electronic archiveThe determining procedure of the medium keywords is as follows:
computing electronic filesKeyword score of candidate words:
wherein:
representing electronic archive->Candidate words->Keyword scores of (2);
representing electronic archive->Candidate words->Is a score of the initial score of (a);
representing electronic archive->Candidate words->Is a fusion hierarchy feature of (1);
representing electronic archive->Candidate words->Is characterized by the average information entropy;
selecting candidate words with highest keyword score as an electronic file5 candidate words with the highest keyword score are selected as the electronic archive +.>Is a key word of (a).
Optionally, in the step S7, file classification is performed according to the electronic file keywords, and file quick retrieval based on the keywords is performed, including:
electronic archive based on candidate word pairs with highest keyword scoresClassifying and adding electronic files->Keywords as electronic archive->And (5) searching the electronic file rapidly by inquiring the search term.
In order to solve the above-described problems, the present invention provides an electronic apparatus including:
a memory storing at least one instruction;
the communication interface is used for realizing the communication of the electronic equipment; and
And the processor executes the instructions stored in the memory to realize the intelligent classification and retrieval method of the files.
In order to solve the above-mentioned problems, the present invention further provides a computer readable storage medium, where at least one instruction is stored, where the at least one instruction is executed by a processor in an electronic device to implement the above-mentioned archive intelligent classification and retrieval method.
Compared with the prior art, the invention provides an intelligent file classifying and searching method, which has the following advantages:
firstly, the scheme provides a word importance quantization mode, and a probability transition matrix among candidate words in an electronic file is calculated according to the TFIDF characteristics of the electronic file and the position characteristics of the candidate words, wherein the electronic fileThe probability transition matrix calculation flow of (1) is as follows:
calculating to obtain electronic fileCandidate words->Jump to candidate word +.>Jump probability>
Wherein:Representing candidate words->Jump weights of (2); generating an electronic archive according to the probability of jumping>Is a probability transition matrix of (a):
wherein:Representing electronic archive->Candidate words->Probability transition matrices of (a);Representing electronic archive->Is a probability transition matrix of (a). Iterative calculation is carried out on the probability transition matrix to obtain an initial score of the candidate word, wherein the initial score is electronic archive +.>The initial score calculation flow of the candidate words is as follows: according to electronic files->Probability transition matrix->Generating an electronic archive->Weights of each candidate word in (a) wherein the candidate word is +.>The weight of (2) is:
wherein:Representing electronic archive->Candidate words->Weights of (2);Represents the damping coefficient, will->Set to 0.8; form an electronic file->Middle->Weight matrix of individual candidate words +.>
Setting the current iteration times of the weight matrix as t, and setting the initial value of t as 0, wherein the t-th iteration result of the weight matrix is +.>The method comprises the steps of carrying out a first treatment on the surface of the Iterating the weight matrix, wherein the iteration formula of the weight matrix is as follows:
if->If the iteration number is smaller than the preset iteration threshold value, the iteration is stopped, and +.>As the final weight matrix, the element value of the mth column of the final weight matrix is the electronic file +.>Candidate words->Is +.>. According to the scheme, the position distance between candidate words is obtained through calculation according to the sentence numbers and the interval word numbers of different candidate words in an electronic file, the edge relation between the candidate words is established based on comparison of the position distance and a window threshold value, a word graph model is obtained through construction, candidate word position features representing the position importance and the word frequency importance of the candidate words are generated, TFIDF features are calculated according to the generated features, the jump probability between the candidate words is obtained through calculation, a probability transition matrix is initialized, the importance weight of the candidate words is determined according to the iterative calculation result of the probability transition matrix, and the importance quantization of the candidate words is calculated as the initial score of the candidate words.
Meanwhile, the scheme provides a method for determining the keywords of the electronic file, which extracts the K nuclear map from the constructed word map model, wherein the word map modelThe extraction flow of the K nuclear map in the (2) is as follows: initializing k=0, initializing set +.>Word graph model->Store to the collection->And initialize the set +.>Set->Is empty; for the collection->Traversing all the word graph model nodes in the model, screening to obtain word graph model nodes with node degree smaller than k, and storing the screened word graph model nodes into a set ∈>As the kth level node in the K kernel graph; selecting nodes of the word graph model and edges connected with the nodes from the set +.>Performing medium recursion deletion; if->Less than K, let->. Calculating to obtain the hierarchical characteristics and average information entropy characteristics of candidate words in the electronic contract based on the extracted K nuclear graph, wherein the electronic archive +.>Candidate words->The calculation flow of the hierarchical features and the average information entropy features is as follows:
acquiring electronic filesCandidate words->Hierarchical features of the corresponding node in the K kernel sub-graph +.>
Wherein:Representation of word graph model->Chinese and candidate wordsNode set with edges corresponding to the nodes, +.>E represents node set +.>Any node in (a);representing the number of layers of the computing node in the K kernel subgraph; obtain electronic archive->Candidate words->Average information entropy feature->
Wherein:Representing candidate words->In electronic files->The number of occurrences in the sentence-word segmentation processing result; n represents the total number of electronic files;representing candidate words->The number of occurrences in all electronic archive clause word segmentation processing results. According to the scheme, the belonging level of the candidate word in the K kernel subgraph is determined according to the node degree of the candidate word in the word graph model, wherein the higher the node degree is, the higher the association degree of the candidate word and other candidate words is, the higher the belonging level is, the candidate words with more occurrence times and less information quantity are filtered out by utilizing the average information entropy feature, the determination of the keywords in the electronic archive is realized, the archive classification is carried out according to the electronic archive keywords, and the quick archive retrieval based on the keywords is supported.
Drawings
FIG. 1 is a flowchart of an intelligent classification and retrieval method for files according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of an electronic device for implementing an intelligent classification and retrieval method of files according to an embodiment of the invention.
In the figure: 1 an electronic device, 10 a processor, 11 a memory, 12 a program, 13 a communication interface.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The embodiment of the application provides an intelligent file classification and retrieval method. The execution subject of the archive intelligent classification and retrieval method includes, but is not limited to, at least one of a server, a terminal, and the like, which can be configured to execute the method provided by the embodiment of the application. In other words, the archive intelligent classification and retrieval method may be performed by software or hardware installed in a terminal device or a server device, where the software may be a blockchain platform. The service end includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
Example 1:
s1: and constructing the electronic archive as a word graph model, and calculating TFIDF characteristics and candidate word position characteristics of each candidate word in the electronic archive according to the word graph model.
In the step S1, the electronic archive is constructed into a word graph model, which comprises the following steps:
constructing the electronic file as a word graph model, wherein the ith electronic fileThe word graph model construction flow is as follows:
s11: to electronic filePerforming sentence segmentation processing, wherein the sentence segmentation processing result is as follows:
wherein:
representing electronic archive->Word segmentation processing result of jth sentence in (a),/j>Representing electronic archive->Is a sentence number;
representing word segmentation processing result->The%>Words(s) of (E)>Representing word segmentation processing result->Word total number in (a);
s12: from electronic filesExtracting nouns from sentence-word segmentation processing results as electronic files +.>Candidate word of (2) electronic archive->The de-duplication candidate word set of (a) is:
wherein:
representing electronic archive->The%>Candidate words (L)>Representing electronic archive->Is a candidate word count of (1);
s13: using the candidate words as nodes of the word graph model, and calculating the position distances among different nodesIf the position distance between the two nodes is smaller than the preset window threshold value, an edge exists between the two nodes, otherwise, no edge exists, wherein the candidate word exists between the two nodesAnd->The position distance between the two is as follows:
wherein:
an exponential function that is based on a natural constant;
representing candidate words->And->The distance between the two positions>And->
Representing electronic archive->Candidate word +.>And (3) withThe number of words in between;
representing candidate words->In electronic files->The number of occurrences in the sentence-word processing result,i.e. candidate word->In->The occurrence of each sentence;Representing candidate wordsIn electronic files->Number of appearance sentences in sentence segmentation processing result,/->
S14: the nodes and the side information between the nodes are formed into a word graph model, and then the electronic file is obtainedThe formed word graph model is as follows:
wherein:
representing electronic archive->The formed word graph model comprises ++>Candidate word node information and side information between candidate word nodes;
representing side information of different candidate words between word graph models;
if it isLess than a preset window threshold, +.>Representing candidate wordsAnd->There are edges in the word graph model, otherwise the candidate word +.>And->There are no edges in the word graph model.
In the step S1, TFIDF characteristics and candidate word position characteristics of each candidate word in the electronic archive are calculated according to the word graph model, and the method comprises the following steps:
calculating TFIDF characteristics and candidate word position characteristics of each candidate word in the electronic archive according to the constructed word graph model, wherein the electronic archiveMiddle->Candidate words->The TFIDF feature and candidate word position feature calculation formula is:
wherein:
representing candidate words->In electronic files->The number of occurrences in the sentence-word segmentation processing result;
representing electronic archive->The total number of words in the sentence segmentation processing result; in an embodiment of the present invention, in the present invention,
n represents the total number of electronic files;
indicating the presence of candidate words->Electronic archive part of (2);
representing candidate words->TFIDF characteristics of (a);
representing candidate words->Is a candidate word location feature of (a).
S2: and calculating a probability transition matrix among candidate words in the electronic archive according to the TFIDF characteristics of the electronic archive and the candidate word position characteristics.
And S2, calculating to obtain a probability transition matrix among candidate words in the electronic archive, wherein the probability transition matrix comprises the following components:
calculating a probability transition matrix between candidate words in the electronic archive according to the calculated TFIDF characteristics of the electronic archive and the candidate word position characteristics, wherein the electronic archiveThe probability transition matrix calculation flow of (1) is as follows: />
Calculating to obtain electronic fileCandidate words->Jump to candidate word +.>Jump probability>
Wherein:
representing candidate words->Jump weights of (2);
generating an electronic archive according to the jump probabilityIs a probability transition matrix of (a):
wherein:
representing electronic archive->Candidate words->Probability transition matrices of (a);
representing electronic archive->Is a probability transition matrix of (a).
S3: and carrying out iterative computation on the probability transition matrix to obtain the initial score of the candidate word.
And in the step S3, iterative computation is carried out on the probability transition matrix to obtain an initial score of the candidate word, and the method comprises the following steps:
iterative computation is carried out on the probability transition matrix to obtain initial scores of candidate words, wherein the initial scores are electronic filesThe initial score calculation flow of the candidate words is as follows:
s31: according to the electronic fileProbability transition matrix->Generating an electronic archive->Weights of each candidate word in (a) wherein the candidate word is +.>The weight of (2) is:
wherein:
representing electronic archive->Candidate words->Weights of (2);
represents the damping coefficient, will->Set to 0.8;
s32: composing an electronic fileMiddle->Weight matrix of individual candidate words +.>
S33: setting the current iteration times of the weight matrix as t, and setting the initial value of t as 0, wherein the t-th iteration result of the weight matrix is that
S34: iterating the weight matrix, wherein the iteration formula of the weight matrix is as follows:
if->If the iteration number is smaller than the preset iteration threshold value, the iteration is stopped, and +.>As the final weight matrix, the element value of the mth column of the final weight matrix is the electronic file +.>Candidate words->Is +.>
No order of noThe process returns to step S34.
S4: and extracting a K nuclear map from the constructed word map model.
And in the step S4, extracting a K nucleolus graph from the word graph model, wherein the K nucleolus graph comprises the following steps of:
extracting a K nuclear graph from the constructed word graph model, wherein the word graph modelThe extraction flow of the K nuclear map in the (2) is as follows:
s41: initializing k=0, initializing a setWord graph model->Store to the collection->In, and initialize the collectionSet->Is empty;
s42: pair aggregationTraversing all the word graph model nodes in the model, screening to obtain word graph model nodes with node degree smaller than k, and storing the screened word graph model nodes into a set ∈>As the kth level node in the K kernel graph;
selecting nodes of the word graph model and edges connected with the nodes from a setPerforming medium recursion deletion;
s43: if it isLess than K, let->The process returns to step S42.
S5: and calculating the hierarchical characteristics and average information entropy characteristics of the candidate words in the electronic contract based on the extracted K nuclear map.
And in the step S5, calculating to obtain the hierarchical characteristics and average information entropy characteristics of the candidate words in the electronic contract based on the K nuclear map, wherein the method comprises the following steps:
calculating to obtain hierarchical characteristics and average information entropy characteristics of candidate words in the electronic contract based on the extracted K nuclear graph, wherein the electronic archiveCandidate words->The calculation flow of the hierarchical features and the average information entropy features is as follows:
acquiring electronic filesCandidate words->Hierarchical features of the corresponding node in the K kernel sub-graph +.>
Wherein:
representation of word graph model->Chinese and candidate words->The corresponding node has a node set of edges,e represents node set +.>Any node in (a);
representing the number of layers of the computing node in the K kernel subgraph;
acquiring electronic filesCandidate words->Average information entropy feature->
Wherein:
representing candidate words->In electronic files->The number of occurrences in the sentence-word segmentation processing result; />
n represents the total number of electronic files;
representing candidate words->The number of occurrences in all electronic archive clause word segmentation processing results.
S6: and on the basis of the initial score of the candidate word, determining the electronic archive key word by fusing the hierarchical features and the average information entropy features.
In the step S6, based on the candidate word initial score, the hierarchical features and the average information entropy features are fused to determine the electronic archive keywords, which comprises the following steps:
determining an electronic archive keyword based on the candidate word initial score by fusing hierarchical features and average information entropy features, wherein the electronic archiveThe determining procedure of the medium keywords is as follows:
computing electronic filesKeyword score of candidate words:
wherein:
representing electronic archive->Candidate words->Keyword scores of (2);
representing electronic archive->Candidate words->Is a score of the initial score of (a);
representing electronic archive->Candidate words->Is a fusion hierarchy feature of (1);
representing electronic archive->Candidate words->Is characterized by the average information entropy;
selecting candidate words with highest keyword score as an electronic file5 candidate words with the highest keyword score are selected as the electronic archive +.>Is a key word of (a).
S7: and classifying files according to the electronic file keywords and supporting quick file retrieval based on the keywords.
In the step S7, file classification is performed according to the electronic file keywords, and file quick retrieval based on the keywords is performed, including:
electronic archive based on candidate word pairs with highest keyword scoresClassifying and adding electronic files->Keywords as electronic archive->And (5) searching the electronic file rapidly by inquiring the search term.
Example 2:
fig. 2 is a schematic structural diagram of an electronic device for implementing an intelligent classification and retrieval method for files according to an embodiment of the invention.
The electronic device 1 may comprise a processor 10, a memory 11, a communication interface 13 and a bus, and may further comprise a computer program, such as program 12, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, including flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may in other embodiments also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of the program 12, but also for temporarily storing data that has been output or is to be output.
The processor 10 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The processor 10 is a control unit (control unit) of the electronic device, connects various components of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device 1 and processes data by running or executing programs or modules (programs 12 for implementing intelligent classification and retrieval of files, etc.) stored in the memory 11, and calling data stored in the memory 11.
The communication interface 13 may comprise a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used to establish a communication connection between the electronic device 1 and other electronic devices and to enable connection communication between internal components of the electronic device.
The bus may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 11 and at least one processor 10 etc.
Fig. 2 shows only an electronic device with components, it being understood by a person skilled in the art that the structure shown in fig. 2 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or may be arranged in different components.
For example, although not shown, the electronic device 1 may further include a power source (such as a battery) for supplying power to each component, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 1 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.
The electronic device 1 may optionally further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.
It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
The program 12 stored in the memory 11 of the electronic device 1 is a combination of instructions that, when executed in the processor 10, may implement:
constructing the electronic archive as a word graph model, and calculating TFIDF characteristics and candidate word position characteristics of each candidate word in the electronic archive according to the word graph model;
calculating a probability transition matrix among candidate words in the electronic file according to the TFIDF characteristics of the electronic file and the position characteristics of the candidate words;
performing iterative computation on the probability transition matrix to obtain an initial score of the candidate word;
extracting a K nuclear map from the constructed word map model;
calculating to obtain the hierarchical characteristics and average information entropy characteristics of the candidate words in the electronic contract based on the extracted K nuclear map;
on the basis of the initial score of the candidate word, determining an electronic archive keyword by fusing the hierarchical characteristics and the average information entropy characteristics;
and classifying files according to the electronic file keywords and supporting quick file retrieval based on the keywords.
Specifically, the specific implementation method of the above instruction by the processor 10 may refer to descriptions of related steps in the corresponding embodiments of fig. 1 to 2, which are not repeated herein.
It should be noted that, the foregoing reference numerals of the embodiments of the present invention are merely for describing the embodiments, and do not represent the advantages and disadvantages of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (7)

1. An intelligent classification and retrieval method for files is characterized by comprising the following steps:
s1: constructing the electronic archive as a word graph model, and calculating TFIDF characteristics and candidate word position characteristics of each candidate word in the electronic archive according to the word graph model;
s2: calculating a probability transition matrix among candidate words in the electronic file according to the TFIDF characteristics of the electronic file and the position characteristics of the candidate words;
s3: performing iterative computation on the probability transition matrix to obtain an initial score of the candidate word;
s4: extracting a K nuclear map from the constructed word map model;
s5: calculating to obtain the hierarchical characteristics and average information entropy characteristics of the candidate words in the electronic contract based on the extracted K nuclear map;
s6: on the basis of the initial score of the candidate word, determining an electronic archive keyword by fusing the hierarchical characteristics and the average information entropy characteristics;
s7: carrying out file classification according to the electronic file keywords and supporting quick file retrieval based on the keywords;
in the step S1, the electronic archive is constructed into a word graph model, which comprises the following steps:
constructing the electronic file as a word graph model, wherein the ith electronic fileThe word graph model construction flow is as follows:
s11: to electronic filePerforming sentence segmentation processing, wherein the sentence segmentation processing result is as follows:
wherein:
representing electronic archive->Word segmentation processing result of jth sentence in (a),/j>Representing electronic archive->Is a sentence number;
representing word segmentation processing result->The%>Words(s) of (E)>Representing word segmentation processing result->Word total number in (a);
s12: from electronic filesExtracting nouns from sentence-word segmentation processing results as electronic files +.>Candidate word of (2) electronic archive->The de-duplication candidate word set of (a) is:
wherein:
representing electronic archive->The%>Candidate words (L)>Representing electronic archive->Is a candidate word count of (1);
s13: candidate words as word graph modelCalculating the position distance between different nodes, if the position distance between two nodes is smaller than the preset window threshold value, then there is an edge between the two nodes, otherwise there is no edge, wherein the candidate wordAnd->The position distance between the two is as follows:
wherein:
an exponential function that is based on a natural constant;
representing candidate words->And->The distance between the two positions>And (2) and
representing electronic archive->Candidate word +.>And->The number of words in between;
representing candidate words->In electronic files->The number of occurrences in the sentence-word processing result,i.e. candidate word->In->The occurrence of each sentence;Representing candidate wordsIn electronic files->Number of appearance sentences in sentence segmentation processing result,/->
S14: the nodes and the side information between the nodes are formed into a word graph model, and then the electronic file is obtainedThe formed word graph model is as follows:
wherein:
representing electronic archive->The formed word graph model comprises ++>Candidate word node information and side information between candidate word nodes;
representing side information of different candidate words between word graph models;
if it isLess than a preset window threshold, +.>Representing candidate wordsAnd->There are edges in the word graph model, otherwise the candidate word +.>And->No edges exist in the word graph model;
in the step S1, TFIDF characteristics and candidate word position characteristics of each candidate word in the electronic archive are calculated according to the word graph model, and the method comprises the following steps:
calculating TFIDF characteristics and candidate word position characteristics of each candidate word in the electronic archive according to the constructed word graph model, wherein the electronic archiveMiddle->Candidate words->The TFIDF feature and candidate word position feature calculation formula is:
wherein:
representing candidate words->In electronic files->The number of occurrences in the sentence-word segmentation processing result;
representing electronic archive->The total number of words in the sentence segmentation processing result;
n represents the total number of electronic files;
indicating the presence of candidate words->Electronic archive part of (2);
representing candidate words->TFIDF characteristics of (a);
representing candidate words->Is a candidate word location feature of (a).
2. An intelligent archive classification and retrieval method according to claim 1, wherein the step S2 of calculating a probability transition matrix between candidate words in the electronic archive comprises:
calculating a probability transition matrix between candidate words in the electronic archive according to the calculated TFIDF characteristics of the electronic archive and the candidate word position characteristics, wherein the electronic archiveThe probability transition matrix calculation flow of (1) is as follows:
calculating to obtain electronic fileCandidate words->Jump to candidate word +.>Is to jump probability of (a)
Wherein:
representing candidate words->Jump weights of (2);
generating an electronic archive according to the jump probabilityIs a probability transition matrix of (a):
wherein:
representing electronic archive->Candidate words->Probability transition matrices of (a);
representing electronic archive->Is a probability transition matrix of (a).
3. An archive intelligent classification and retrieval method according to claim 2, wherein in step S3, iterative computation is performed on the probability transition matrix to obtain an initial score of the candidate word, including:
iterative computation is carried out on the probability transition matrix to obtain initial scores of candidate words, wherein the initial scores are electronic filesThe initial score calculation flow of the candidate words is as follows:
s31: according to the electronic fileProbability transition matrix->Generating an electronic archive->Weights of each candidate word in (a) wherein the candidate word is +.>The weight of (2) is:
wherein:
representing electronic archive->Candidate words->Weights of (2);
represents the damping coefficient, will->Set to 0.8;
s32: composing an electronic fileMiddle->Weight matrix of individual candidate words +.>
S33: setting the current iteration times of the weight matrix as t, and setting the initial value of t as 0, wherein the t-th iteration result of the weight matrix is that
S34: iterating the weight matrix, wherein the iteration formula of the weight matrix is as follows:
if it isIf the iteration number is smaller than the preset iteration threshold value, the iteration is stopped, and +.>As the final weight matrix, the element value of the mth column of the final weight matrix is the electronic file +.>Candidate words->Is +.>
No order of noThe process returns to step S34.
4. The intelligent archive classification and retrieval method according to claim 1, wherein the step S4 of extracting K nucleograms from the word graph model comprises:
extracting a K nuclear graph from the constructed word graph model, wherein the word graph modelThe extraction flow of the K nuclear map in the (2) is as follows:
s41: initializing k=0, initializing a setWord graph model->Store to the collection->And initialize the set +.>Set->Is empty;
s42: pair aggregationTraversing all the word graph model nodes in the model, screening to obtain word graph model nodes with node degree smaller than k, and storing the screened word graph model nodes into a set ∈>As the kth level node in the K kernel graph;
selecting nodes of the word graph model and edges connected with the nodes from a setPerforming medium recursion deletion;
s43: if it isLess than K, let->The process returns to step S42.
5. The intelligent archive classification and retrieval method according to claim 4, wherein in the step S5, the hierarchical features and the average information entropy features of the candidate words in the electronic contract are calculated based on the K-epipolar graph, and the method comprises the following steps:
calculating to obtain hierarchical characteristics and average information entropy characteristics of candidate words in the electronic contract based on the extracted K nuclear graph, wherein the electronic archiveCandidate words->The calculation flow of the hierarchical features and the average information entropy features is as follows:
acquiring electronic filesCandidate words->Hierarchical features of the corresponding node in the K kernel sub-graph +.>
Wherein:
representation of word graph model->Chinese and candidate words->The corresponding node has a node set of edges,e represents node set +.>Any node in (a);
representing the number of layers of the computing node in the K kernel subgraph;
acquiring electronic filesCandidate words->Average information entropy feature->
Wherein:
representing candidate words->In electronic files->The number of occurrences in the sentence-word segmentation processing result;
n represents the total number of electronic files;
representing candidate words->The number of occurrences in all electronic archive clause word segmentation processing results.
6. The intelligent archive classification and retrieval method according to claim 1, wherein in the step S6, based on the initial score of the candidate word, determining the electronic archive keyword by fusing the hierarchical feature and the average information entropy feature comprises:
determining an electronic archive keyword based on the candidate word initial score by fusing hierarchical features and average information entropy features, wherein the electronic archiveThe determining procedure of the medium keywords is as follows:
computing electronic filesKeyword score of candidate words:
wherein:
representing electronic archive->Candidate words->Keyword scores of (2);
representing electronic archive->Candidate words->Is a score of the initial score of (a);
representing electronic archive->Candidate words->Is a fusion hierarchy feature of (1);
representing electronic archive->Candidate words->Is characterized by the average information entropy;
selecting candidate words with highest keyword score as an electronic file5 candidate words with the highest keyword score are selected as the electronic archive +.>Is a key word of (a).
7. The intelligent archive classification and retrieval method according to claim 6, wherein in step S7, archive classification is performed according to electronic archive keywords and quick archive retrieval based on the keywords is performed, comprising:
electronic archive based on candidate word pairs with highest keyword scoresClassifying and adding electronic files->Keywords as electronic archive->And (5) searching the electronic file rapidly by inquiring the search term.
CN202311204538.XA 2023-09-19 2023-09-19 Intelligent classification and retrieval method for files Active CN117216217B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311204538.XA CN117216217B (en) 2023-09-19 2023-09-19 Intelligent classification and retrieval method for files

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311204538.XA CN117216217B (en) 2023-09-19 2023-09-19 Intelligent classification and retrieval method for files

Publications (2)

Publication Number Publication Date
CN117216217A CN117216217A (en) 2023-12-12
CN117216217B true CN117216217B (en) 2024-03-22

Family

ID=89038572

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311204538.XA Active CN117216217B (en) 2023-09-19 2023-09-19 Intelligent classification and retrieval method for files

Country Status (1)

Country Link
CN (1) CN117216217B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945228A (en) * 2012-10-29 2013-02-27 广西工学院 Multi-document summarization method based on text segmentation
CN111079145A (en) * 2019-12-04 2020-04-28 中南大学 Malicious program detection method based on graph processing
CN111460796A (en) * 2020-03-30 2020-07-28 北京航空航天大学 Accidental sensitive word discovery method based on word network
CN113656429A (en) * 2021-07-28 2021-11-16 广州荔支网络技术有限公司 Keyword extraction method and device, computer equipment and storage medium
CN113822072A (en) * 2021-09-24 2021-12-21 广州博冠信息科技有限公司 Keyword extraction method and device and electronic equipment
CN116089620A (en) * 2023-04-07 2023-05-09 日照蓝鸥信息科技有限公司 Electronic archive data management method and system
CN116662479A (en) * 2023-04-27 2023-08-29 浙江工业大学 Text matching method for medical insurance catalogs

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10366117B2 (en) * 2011-12-16 2019-07-30 Sas Institute Inc. Computer-implemented systems and methods for taxonomy development
US20210326525A1 (en) * 2020-04-16 2021-10-21 Pusan National University Industry-University Cooperation Foundation Device and method for correcting context sensitive spelling error using masked language model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945228A (en) * 2012-10-29 2013-02-27 广西工学院 Multi-document summarization method based on text segmentation
CN111079145A (en) * 2019-12-04 2020-04-28 中南大学 Malicious program detection method based on graph processing
CN111460796A (en) * 2020-03-30 2020-07-28 北京航空航天大学 Accidental sensitive word discovery method based on word network
CN113656429A (en) * 2021-07-28 2021-11-16 广州荔支网络技术有限公司 Keyword extraction method and device, computer equipment and storage medium
CN113822072A (en) * 2021-09-24 2021-12-21 广州博冠信息科技有限公司 Keyword extraction method and device and electronic equipment
CN116089620A (en) * 2023-04-07 2023-05-09 日照蓝鸥信息科技有限公司 Electronic archive data management method and system
CN116662479A (en) * 2023-04-27 2023-08-29 浙江工业大学 Text matching method for medical insurance catalogs

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Rudolf A. Braun等."A Comparison of Methods for OOV-Word Recognition on a New Public Dataset".《 ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)》.第1-4页. *
张洁等."基于词频分析和可视化共词网络图的国内外移动学习研究热点对比分析".《现代远距离教育》.2014,第76-83页. *

Also Published As

Publication number Publication date
CN117216217A (en) 2023-12-12

Similar Documents

Publication Publication Date Title
CN111104794B (en) Text similarity matching method based on subject term
WO2022121171A1 (en) Similar text matching method and apparatus, and electronic device and computer storage medium
CN109446517B (en) Reference resolution method, electronic device and computer readable storage medium
CN112395395B (en) Text keyword extraction method, device, equipment and storage medium
WO2021052148A1 (en) Contract sensitive word checking method and apparatus based on artificial intelligence, computer device, and storage medium
CN113095076A (en) Sensitive word recognition method and device, electronic equipment and storage medium
WO2021051864A1 (en) Dictionary expansion method and apparatus, electronic device and storage medium
CN112633000B (en) Method and device for associating entities in text, electronic equipment and storage medium
CN114780746A (en) Knowledge graph-based document retrieval method and related equipment thereof
CN113360654B (en) Text classification method, apparatus, electronic device and readable storage medium
CN113660541B (en) Method and device for generating abstract of news video
CN112883730A (en) Similar text matching method and device, electronic equipment and storage medium
CN116797195A (en) Work order processing method, apparatus, computer device, and computer readable storage medium
CN113094538A (en) Image retrieval method, device and computer-readable storage medium
CN114461783A (en) Keyword generation method and device, computer equipment, storage medium and product
CN112926297B (en) Method, apparatus, device and storage medium for processing information
CN112579781B (en) Text classification method, device, electronic equipment and medium
CN112529743B (en) Contract element extraction method, device, electronic equipment and medium
CN117216217B (en) Intelligent classification and retrieval method for files
CN114969385B (en) Knowledge graph optimization method and device based on document attribute assignment entity weight
CN114416990B (en) Method and device for constructing object relation network and electronic equipment
CN113705201B (en) Text-based event probability prediction evaluation algorithm, electronic device and storage medium
CN114741550A (en) Image searching method and device, electronic equipment and computer readable storage medium
CN114996400A (en) Referee document processing method and device, electronic equipment and storage medium
CN114416174A (en) Model reconstruction method and device based on metadata, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant