CN116662521B

CN116662521B - Electronic document screening and inquiring method and system

Info

Publication number: CN116662521B
Application number: CN202310920071.2A
Authority: CN
Inventors: 单良; 王亚平; 路阳; 江伟欢; 刘伟家; 郑楠
Original assignee: Guangdong Construction Project Quality Safety Inspection Station Co ltd
Current assignee: Guangdong Construction Project Quality Safety Inspection Station Co ltd
Priority date: 2023-07-26
Filing date: 2023-07-26
Publication date: 2023-11-14
Anticipated expiration: 2043-07-26
Also published as: CN116662521A

Abstract

An electronic document screening and inquiring method and system belong to the field of information retrieval, and the method comprises the following steps: connecting a service management system, and determining a retrieval domain; acquiring a target document set based on a search domain, and constructing a search database by cross-domain association; determining a main keyword and a secondary keyword set based on the query requirement, and configuring a keyword matrix; traversing a plurality of search data sub-databases to match the keyword matrix, and generating a similarity matrix; setting a similarity threshold value, judging a similarity matrix, and determining a single keyword matching result; summing the similarity matrixes matrix by matrix columns based on the single keyword matching result to generate a similarity matching result; and carrying out document mapping based on the similarity matching result, and determining a document query result. The application solves the technical problems of low accuracy and efficiency of electronic document screening query in the prior art, realizes high-accuracy, dynamic and diversified query of the electronic document, and achieves the technical effect of improving the accuracy and efficiency of electronic document screening.

Description

Electronic document screening and inquiring method and system

Technical Field

The application relates to the field of information retrieval, in particular to a method and a system for screening and inquiring electronic documents.

Background

Currently, with the development of information technology, various organizations and enterprises accumulate a large amount of electronic document data. These data are typically distributed across different business systems, with cross-domain and heterogeneous features. At present, the method for realizing large-scale document retrieval mainly comprises the following steps: constructing a centralized index, and indexing by adopting a crawler technology; and extracting document features by means of metadata and the like and realizing searching based on the feature index. However, these methods have poor effects in cross-domain heterogeneous data scenarios, and low efficiency and query accuracy.

Disclosure of Invention

The application provides a method and a system for screening and inquiring electronic documents, which aim to solve the technical problems of low accuracy and efficiency of screening and inquiring the electronic documents in the prior art.

In view of the above problems, the present application provides a method and a system for screening and querying electronic documents.

The first aspect of the present disclosure provides an electronic document screening and querying method, which includes: connecting a service management system, and determining a retrieval domain; acquiring a target document set based on a search domain, constructing a search database by cross-domain association, wherein the search database is composed of a plurality of search database sub-databases, each search database sub-database is in data type difference, and the search database is updated in real time; determining a primary keyword set and a secondary keyword set based on query requirements, configuring a keyword matrix, and acquiring the secondary keyword set by diversification of the primary keyword set; combining a similarity matching algorithm, traversing a plurality of search data sub-databases to match the keyword matrix, and generating a similarity matrix, wherein the occurrence frequency of the keywords is additional generation information; setting a similarity threshold, judging a similarity matrix based on the similarity threshold, and determining a single keyword matching result, wherein a matching success mark is 1, and a matching failure mark is 0; summing the similarity matrixes matrix by matrix columns based on the single keyword matching result to generate a similarity matching result, wherein the similarity matching result represents the comprehensive similarity between the matched keyword set and the single document in the search database; and carrying out document mapping based on the similarity matching result, and determining a document query result.

In another aspect of the present disclosure, an electronic document screening query system is provided, the system including: the search domain determining module is used for connecting a service management system and determining a search domain; the search database construction module acquires a target document set based on a search domain, constructs a search database based on cross-domain association, wherein the search database is composed of a plurality of search database sub-databases, each search database sub-database is in data type difference, and the search database is updated in real time; the keyword matrix module is used for determining a main keyword set and a secondary keyword set based on the query requirement, configuring a keyword matrix and acquiring the secondary keyword set through the diversification of the main keyword set; the similarity matrix module is used for combining a similarity matching algorithm, traversing a plurality of search data sub-databases to match the keyword matrix, and generating a similarity matrix, wherein the occurrence frequency of the keywords is additional generation information; the keyword matching result module is used for setting a similarity threshold value, judging a similarity matrix based on the similarity threshold value, and determining a single keyword matching result, wherein a matching success mark is 1, and a matching failure mark is 0; the similarity matching result module is used for summing the similarity matrixes matrix by matrix columns based on the single keyword matching result to generate a similarity matching result, and the similarity matching result represents the comprehensive similarity between the matched keyword set and the single document in the search database; and the document query result module is used for carrying out document mapping based on the similarity matching result and determining the document query result.

One or more technical schemes provided by the application have at least the following technical effects or advantages:

because the search domain is determined by the connection service management system, a data basis is provided for the subsequent construction of a search database and the inquiry; acquiring a target document set based on a search domain, constructing a search database by cross-domain association, and realizing a unified search platform of large-scale documents, thereby improving the search efficiency; based on the query requirement, rich primary keywords and secondary keywords are obtained, and the diversified query requirement is supported; traversing a plurality of search data sub-databases to match the keyword matrix by combining a similarity matching algorithm, generating a similarity matrix, and combining occurrence frequencies of keywords to realize high-precision matching; setting a similarity threshold, judging a similarity matrix based on the similarity threshold, determining a single keyword matching result, and only keeping highly relevant matching to improve query accuracy; based on the single keyword matching result, summing the similarity matrixes matrix by matrix columns to generate a similarity matching result, so that the calculated amount is reduced, and the efficiency is improved; the technical scheme of determining the document query result and the high-precision query result based on the similarity matching result solves the technical problems of low screening accuracy and efficiency of the electronic documents in the prior art, realizes high-precision, dynamic and diversified query of the electronic documents, and achieves the technical effect of improving the screening accuracy and efficiency of the electronic documents.

The foregoing description is only an overview of the present application, and is intended to be implemented in accordance with the teachings of the present application in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present application more readily apparent.

Drawings

FIG. 1 is a schematic diagram of a possible flow of an electronic document screening and querying method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a possible flow for obtaining a similarity matrix in an electronic document screening and querying method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a process for determining a possible document query result in an electronic document screening query method according to an embodiment of the present application;

fig. 4 is a schematic diagram of a possible structure of an electronic document screening query system according to an embodiment of the present application.

Reference numerals illustrate: the system comprises a determining search domain module 11, a search database construction module 12, a keyword matrix module 13, a similarity matrix module 14, a keyword matching result module 15, a similarity matching result module 16 and a document query result module 17.

Detailed Description

The technical scheme provided by the application has the following overall thought:

The embodiment of the application provides a method and a system for screening and inquiring electronic documents, which gradually realize matching, generate a similarity matrix and an inquiring result by constructing a cross-domain retrieval database and adopting a keyword matrix and a similarity algorithm, set a threshold value and a priority level to improve the precision and the efficiency, and finally achieve the technical effects of high-precision, dynamic and diversified inquiring of the electronic documents and optimizing document screening.

Having described the basic principles of the present application, various non-limiting embodiments of the present application will now be described in detail with reference to the accompanying drawings.

Example 1

As shown in fig. 1, an embodiment of the present application provides a method for screening and querying an electronic document, where the method includes:

step S100: connecting a service management system, and determining a retrieval domain;

specifically, the service management system refers to a system platform that stores and manages electronic documents. The search field is a range for searching the electronic document, and is set according to the query requirement, such as inputting a document ID section, a path keyword, a metadata attribute and the like which need to be searched. And connecting different business management systems, such as an OA system, an ERP system, a CRM system and the like, acquiring electronic document data range information in the business management system through technical means such as a system interface and the like, determining an electronic document set needing to be searched and inquired, and providing a data source and an inquiry range foundation for subsequent steps, wherein the system interface technology comprises Web service, remote call and the like. Through connecting the business relation system and determining the search domain, the electronic document resource acquisition and the query target determination are realized, and a foundation is provided for constructing a cross-domain search database and realizing efficient query.

Step S200: acquiring a target document set based on the search domain, and constructing a search database by cross-domain association, wherein the search database is composed of a plurality of search database sub-databases, each search database sub-database is different in data type, and the search database is updated in real time;

specifically, based on the determined search domain, a target document set is obtained by interfacing with a service system interface or through directory browsing, and the target document set is a set of all electronic documents in the search domain to be searched and queried. The cross-domain association construction of the retrieval database refers to the construction of a unified retrieval database platform by associating target document sets of different service systems and data sources. The database comprises a plurality of retrieval data sub-databases from different domains, and data type differences, such as differences of file formats and metadata expression modes, exist among the sub-databases.

The search database is constructed by adopting the technologies of document management, document index engine and the like. The target document set needs to be preprocessed such as format conversion, metadata extraction and the like, so that the target document set can be efficiently queried under the same retrieval platform. The real-time updating of the search database is to update the data in the database along with the updating of the service system so as to ensure the timeliness of the query result. The sub-libraries of each search database are built according to a service system, a document type and the like, and a unified document analysis and index building mode is adopted when the sub-libraries are built, and incremental updating is supported, so that document change of the service system is synchronized in time.

By constructing a cross-domain associated search database, the centralized management and unified search of heterogeneous electronic documents are realized, the problem of data islanding is solved, the database is logically integrated, the real-time performance of the query result is ensured by incremental update, and a technical foundation is provided for subsequent intelligent recommendation and efficient query.

Step S300: determining a primary keyword set and a secondary keyword set based on query requirements, and configuring a keyword matrix, wherein the secondary keyword set is obtained by diversification processing of the primary keyword set;

specifically, based on the specific query requirement of the querier, the query condition is resolved by adopting technologies such as query analysis and the like, main concept words and key words are obtained as main key words, and the auxiliary key words are generated by adopting an expanded vocabulary, a semantic network and the like based on the main key words. The primary set of keywords represents the subject and focus of the query requirement, and related or semantic terms of the query requirement are represented from the keywords. Configuring the keyword matrix refers to constructing a two-dimensional matrix, wherein matrix rows represent primary keywords, matrix columns represent secondary keywords, and matrix values can be initialized by using the co-occurrence times or the relativity. And each unit of the matrix records the occurrence frequency or the relativity of the primary keywords and the secondary keywords in the target document set, and selects the primary keywords and the secondary keywords with high relativity so as to improve the subsequent matching precision. The diversification processing refers to comprehensively considering various semantics and related words of the query requirement, generating a rich secondary keyword set, such as expanding a vocabulary based on primary keywords, extracting related words by utilizing a semantic recognition technology, deriving new related words by calculation as secondary keywords, and the like.

The main key words are obtained by analyzing the query requirement, the secondary key words are obtained through the diversification processing, the key word matrix expresses the association strength between the main key words and the secondary key words in a concise form, a reference basis is provided for the correlation judgment of the key words and the documents, the accurate understanding and rich expression of the query intention are realized, a foundation is laid for the intelligent matching of the follow-up documents, and therefore the accurate query is realized.

Step S400: combining a similarity matching algorithm, traversing the plurality of search data sub-databases to match the keyword matrix, and generating a similarity matrix, wherein the occurrence frequency of the keywords is additional generation information;

specifically, each primary keyword and each secondary keyword of the keyword matrix are matched in each search database sub-database of the constructed search database by using a similarity matching algorithm, and a similarity matrix is generated. The similarity matching algorithm adopts a vector space model, a word bag model and the like to calculate the similarity between the principal and subordinate keywords of the keyword matrix and each sub-library document; traversing the sub-libraries means that matching operation is carried out on each sub-library one by one, and due to different data types of the sub-libraries, a similarity algorithm applicable to different types of data needs to be selected. The similarity matrix is a two-dimensional matrix obtained through retrieval, matrix rows represent master and slave keywords, matrix columns represent documents of a target document set, and each unit of the matrix records similarity values of the master and slave keywords and the documents to represent matching degree. The occurrence frequency of the keywords represents the importance of the keywords in the document, provides reference for subsequent result judgment, and is stored as additional information. For example, a master keyword, a slave keyword and a document vector space model are selected to calculate the similarity, a mode of matching different sub-libraries respectively can be adopted, the documents can be preprocessed into a unified vector space and then uniformly matched, and the occurrence frequency of the keywords can be used as a weight factor of the similarity.

By adopting the similarity algorithm, the intelligent matching of the keyword matrix and the massive documents is realized, the similarity matrix accurately expresses the correlation degree between the master keyword and each document in a digital form, and support is provided for the generation of the subsequent query result. Different sub-libraries are respectively matched, so that an optimal algorithm can be selected according to the data characteristics of the sub-libraries, and the efficiency is improved. The preservation of the occurrence frequency provides a reference for the result judgment so as to improve the accuracy of screening inquiry.

Step S500: setting a similarity threshold, judging the similarity matrix based on the similarity threshold, and determining a single keyword matching result, wherein a matching success mark is 1, and a matching failure mark is 0;

specifically, the similarity threshold refers to a lower limit of the degree of relatedness of the primary keyword, the secondary keyword, and the document. By analyzing query logs, interactive feedback and algorithm optimization, a similarity threshold is set according to the accuracy requirement of the query and the recall rate of the result, and is used as a judgment standard. For example, analyzing a large number of historical query logs by a data analysis technology, summarizing a similarity distribution interval of user selection results under similar query conditions, and selecting an upper limit value of the interval as a threshold value; initializing a relatively low threshold value, obtaining a matching result based on the threshold value, judging and feeding back by a user, and gradually increasing the threshold value until the user is satisfied; the method comprises the steps of constructing an objective function, considering query precision and recall rate, repeatedly performing trial calculation by adopting a machine learning algorithm, and searching for a threshold value for enabling the objective function to reach the optimal value. The threshold value is set too high, so that matching failure is too high, and the result is accurate but the recall rate is low; threshold settings are too low for recall to be high but the results are less accurate.

After the similarity threshold value is obtained, each unit of the similarity matrix is judged, and a single keyword matching result is obtained. Judging the similarity matrix means that the size relation between the similarity value of each unit of the matrix and a threshold value is judged one by one, and the matching below the threshold value is judged to be failed and is not considered; and judging that the matching with the threshold value is successful, marking each result, namely marking the success as 1 and marking the failure as 0, thereby obtaining a single keyword matching result which is a main keyword and a matching result of the secondary keyword and each document as candidates of a query result.

By setting the similarity threshold, accurate judgment of the matching result of the similarity matrix is realized. The judging result is a basis for generating a subsequent query result and provides a basis for generating a similarity result subsequently.

Step S600: summing the similarity matrixes by matrix columns based on the single keyword matching result to generate a similarity matching result, wherein the similarity matching result represents the comprehensive similarity between a matched keyword set and a single document in the search database;

specifically, based on the obtained matching result of the master key word and the single key word of each document, summing operation is performed on all units in each matrix array of the similarity matrix, and a similarity matching result is generated. The single keyword matching result represents the matching relation between the master keyword and the slave keyword and each document by 1 or 0. The similarity matrix records the similarity between the master keyword and each document, and the matrix column corresponds to each document in the target document set. Matrix-by-matrix summation refers to summing all the cells in each column of the matrix in turn to obtain the total similarity in the matrix. For example, the similarity of units for which the single match result is 1 may be summed; the similarity values of all the units can be summed and multiplied by the proportion of the master key words and the slave key words which are successfully matched to be used as weights. The similarity matching result represents the comprehensive similarity between the master keyword set and each document, consists of the total similarity of each matrix column of the similarity matrix, and expresses the matching degree between the master keyword set and the document.

And the similarity of all units in the similarity matrix array is summed to generate the comprehensive matching similarity of the master and slave keywords and each document, so that the comprehensive correlation degree of the master and slave keywords and each document is accurately expressed in an integral digital form, a basis is provided for accurate judgment and sequencing of subsequent results, the accumulation and comprehensive evaluation of the matching results are realized, and the integral improvement of the matching precision is realized.

Step S700: and carrying out document mapping based on the similarity matching result to determine a document query result.

Specifically, the similarity matching result records the comprehensive similarity between the master keyword and each document, and identifies the accuracy degree of matching. The document mapping means that the document set is reordered or screened according to the similarity matching result, the document which is most matched with the query requirement is selected, the mapping can be realized by adopting the modes of nearest neighbor ordering, similarity threshold screening and the like, namely, the document set can be subjected to nearest neighbor ordering from large to small according to the similarity, and the first N documents with higher similarity are selected as the result; a similarity selection threshold may be set, and documents whose similarity exceeds the threshold may be selected as a result. The document query result refers to that the first N documents which are most relevant to the query requirement are selected from the mapping results and returned to the user as final results, and the value of N can be set by the user during query or can be determined according to service scene experience.

By remapping the document set, document prioritization and accurate selection based on keyword matrix matching results are achieved. The selected result document is most matched with and related to the query requirement, so that the high-precision, dynamic and diversified query of the electronic document is realized, and the technical effects of improving the screening accuracy and the retrieval efficiency of the electronic document are achieved.

Further, the embodiment of the application further comprises:

step S310: refining a plurality of primary keywords as the set of primary keywords based on the query requirement;

step S320: configuring diversified processing amplitude modulation;

step S330: performing upper level processing on the primary keyword set based on the multi-level processing amplitude modulation, and determining a first subordinate keyword set;

step S330: performing lower level processing on the primary keyword set based on the multi-level processing amplitude modulation, and determining a second subordinate keyword set;

step S340: performing conversion processing on the primary keyword set to determine a third subordinate keyword set;

step S350: determining a secondary keyword set based on the first secondary keyword set, the second secondary keyword set and the third secondary keyword set, wherein the secondary keyword set is provided with a main relatedness identifier;

Step S360: and constructing the keyword matrix based on the primary keyword set and the secondary keyword set by taking the keyword sequence as a matrix row and the keyword category as a matrix column.

Specifically, by parsing the query requirements, understanding the query intent, several keywords expressing the query topic and the center word are extracted as a set of primary keywords. The main key word set expresses main subject and semantic of query requirement, and has correlation with main key words to jointly express query intention. The query requirement may be determined by a query keyword, a query sentence, or a question entered by the user. The extraction of the key words requires analysis and understanding, and the extraction of the key words and the concept core words can be performed through word frequency statistics, word frequency labeling, keyword extraction algorithms and the like.

The configuration of the diversification processing amplitude modulation means that a ratio or a numerical value is set as an upper limit for generating a secondary keyword set, and the expansion adjustment of the primary keyword set is realized. The generation scale of the secondary keyword set is controlled by diversification processing amplitude modulation, the number and the types of the secondary keywords can be increased by up-regulating, the generation scale is reduced by down-regulating, and the secondary keyword set is determined according to specific application scenes and query types. On the word class or concept level, the upper level words or higher level concepts of the main key words are deduced and searched based on the main key word set, and the main key words are subjected to upper level processing expansion to generate a first subordinate key word set. On the word class or concept level, deducing and searching a hyponym or a low-level concept based on the main keyword set, performing hyponymy processing expansion on the main keyword, and generating a second subordinate keyword set. And generating a third subordinate keyword set by performing conversion processing expansion on the primary keywords in terms of vocabulary or semantics based on part-of-speech conversion, synonym replacement, related word expansion and the like of the primary keywords.

And combining the generated three subordinate keyword sets to form a subordinate keyword set, marking the relevance between each subordinate keyword and the main keyword, and marking the relevance for the main relevance. The main relevance mark represents the association strength of the secondary keywords and a main keyword set, wherein the main relevance of the first secondary keyword set is highest; the third set of subordinate keywords has the lowest master relatedness.

The keyword sequence is to combine the primary keywords with the secondary keywords to represent an expanded understanding of the query requirements. The keyword category is the classification characteristic of the document set, and the distinguishing processing of different types of data is realized. Connecting the main keyword set and the keywords of the auxiliary keyword set in series to form a row of a matrix; and classifying keywords in the document set as columns of a matrix, and constructing a keyword matrix. The keyword matrix is a two-dimensional matrix, the matrix acts as a keyword sequence, the matrix column is a keyword category, and the cross points of each unit of the matrix represent the matching and the association of the keyword sequence and the document classification.

Through setting amplitude modulation and adopting multielement processing, deep understanding and expansion of query requirements are realized, matching modeling of query and data characteristics is realized, and support is provided for realizing high-precision intelligent query.

Further, as shown in fig. 2, the embodiment of the present application further includes:

step S410: extracting the primary keyword set based on the keyword matrix;

step S420: traversing the plurality of search databases, performing similarity matching on the primary key word set, and determining a similarity matrix;

step S430: if the similarity matrix is empty, extracting the secondary keyword set, traversing the plurality of search data sub-databases to perform similarity matching, and determining a two-term similarity matrix;

step S440: and if the two similarity matrixes are empty, traversing the plurality of search data sub-databases based on the primary key word set to perform semantic recognition, and acquiring three similarity matrixes.

Specifically, a main key word set is directly extracted from the constructed key word matrix, and the main key word set is the core of the query requirement. Searching and matching the main key words in a plurality of search data sub-databases to obtain similarity values of each main key word and each document in the data sub-databases to form a similarity matrix, and calculating semantic similarity of the main key words and the documents in the data sub-databases by similarity matching. The method can be realized by adopting a vector space model, a BM25 algorithm, a word embedding model and the like, and a similarity matrix records similarity matching values of the primary key words and all documents in a database.

When the conditions of small scale of the search database sub-database, low vocabulary coverage rate and the like exist, one similarity matrix is empty, a secondary keyword set is extracted from the keyword matrix through methods such as column condition filtering and KMeas clustering, similarity matching is performed by traversing the database sub-database, and similarity values of each secondary keyword and each document are obtained to form a two-term similarity matrix. The two-term similarity matrix records similarity matching values between the keywords and all documents in the database.

When the conditions of insufficient expanded vocabulary, low vocabulary coverage rate and the like exist, the two similarity matrixes are empty, the semantic recognition technology is adopted to traverse the database to carry out semantic recognition based on the primary key word set, and whether the primary key words are matched and associated with all documents in the semantic sense is judged to form the three similarity matrixes. The semantic recognition realizes the matching judgment by calculating the adjacency degree and the like of the primary key words and the document on the semantic network. The three-term similarity matrix records the semantic relativity between the primary key words and all the documents in the database.

By adopting different technical means to establish different similarity matrixes under a plurality of different conditions, the similarity matching and semantic recognition of the master key words and the document set are realized, and the generated three similarity matrixes provide comprehensive judgment basis for subsequent result judgment in terms of quantity and quality, so that the limitation of a single matching mode is avoided, and the query precision is improved.

Further, the embodiment of the application further comprises:

the similarity matrix calculation formula is:

；

wherein alpha represents the occurrence frequency of the key words, beta represents the main correlation degree,for the keyword matrix to be matched, for the column matrix,/for the keyword matrix to be matched>For the document matrix in the search database, as a row matrix, < >>Representing the similarity between the M-th keyword and the N-th document, M, N is a magnitude, and represents the number of keyword terms and the number of document terms, and beta=1 for the similarity matrix.

Specifically, after a main keyword set and a secondary keyword set are determined and each retrieval database is built, a document matrix B is built corresponding to different databases, the document sets classified and arranged in the databases are represented, the number of lines of the document sets corresponds to the number of documents, and the number of columns of the document sets corresponds to the feature dimension of the documents. Aiming at similarity matching under different conditions, the meaning represented by a column matrix A is different, and a main key word matrix constructed by a main key word set is constructed in the matrix A; constructing a secondary keyword matrix constructed from a keyword set in a two-term similarity matrix; in the three-term similarity matrix construction, a matrix A is a keyword matrix constructed based on a master-slave keyword set. M, N are the magnitudes of the number of key terms and the number of document terms, respectively, representing the size of matrices a and B.

Alpha represents the occurrence frequency of the keywords, and is used for measuring the importance of the keywords, and the higher the keyword frequency is, the greater the importance is. And beta represents the main correlation degree and is used for measuring the association strength of the secondary keywords and the corresponding main keywords, and the larger the main correlation degree is, the closer the association of the keywords and the query requirement is. Because a term similarity matrix is constructed based on a set of primary keys, the primary key has the greatest strength of association with the query topic, so the primary relevance β=1. The setting of alpha and beta is set according to the key words and the relevance of the key words and the query theme.

The similarity matrix calculation formula adopts vector multiplication shapeThe formula is that each column vector of the matrix A is multiplied with each row vector of the matrix B, and the obtained multiplication result represents the matching score of the corresponding keyword and the document. The product of the matching score divided by the vector length yields the similarityAnd the values form a similarity matrix.

And obtaining a matching result matrix by directly calculating two matrices to be matched through a similarity matrix calculation formula, so as to realize high-precision query and improve the efficiency of screening and querying the electronic documents.

Further, the embodiment of the application further comprises:

the matrix sum formula is:

；

wherein, For the similarity matching result of M keywords and the jth document, the term +.>Representing the similarity of the ith keyword and the jth document, i +.>M，j/>N。

Specifically, after the similarity matrix is obtained, summing the values of each column of the similarity matrix to obtain a matrix array vector; the matrix column vector represents the matching degree of the keyword set and the corresponding column document, and the summation result of the j-th column vectorRepresenting the total matching degree of the keyword set and the jth document, wherein i ∈>M，j/>N. The larger the value of the matrix column vector, the higher the matching degree of the corresponding document and the keyword set, and the greater the importance of the document.

The total matching degree of the keyword set and each document is obtained by summing the values of each column of the similarity matrix, so that support is provided for realizing importance judgment and sorting of the documents, and a basis is provided for selecting the document which is most matched with the keywords according to the matching degree.

Further, the embodiment of the application further comprises:

step S810: configuring a multi-metadata processing rule;

step S820: based on the data processing rules, performing rule matching and data preprocessing on the plurality of search data sub-databases to determine a preprocessing database;

step S830: and carrying out matching execution of the keyword matrix based on the preprocessing database.

Specifically, according to different data sources and characteristics, the corresponding data processing rules are configured by integrating the source data format, quality, classification fineness and other factors, and are used for guiding the data extraction, cleaning, conversion and screening processes. The configuration of the multi-source data processing rule needs to be from macroscopic view to microscopic view, the aspects of homologous normalization, cross-source normalization, mutual cooperation and the like need to be set by considering characteristic elements among data sources and in the sources, and manual judgment is needed to be combined with technical algorithms, such as an Apriori algorithm, a Word2Vec algorithm, a Bayesian classifier and the like.

And executing preprocessing processes on different data sources, namely a plurality of retrieval data sub-databases according to set rules, wherein the preprocessing processes comprise data cleaning, extraction, conversion and screening to obtain a structured preprocessing database. The preprocessing database extracts the data source characteristics, and the data structure of the preprocessing database meets the requirement of subsequent matrix matching operation. Constructing a document matrix by preprocessing a database, constructing a keyword matrix by a keyword set, and multiplying the document matrix by the keyword matrix to obtain a matching result of keywords and a database.

By configuring data processing rules, preprocessing and data extraction are performed on different data sources to form a structured database meeting the matrix matching requirement, then multiplication matching is performed on the basis of the database and the matrix constructed by the keywords, rapid and accurate matching of the keywords and massive data sources is achieved, simplification and standardization are performed on the data sources, and document matching is achieved through the matrix, so that the document matching efficiency is improved, and the accuracy of a final result is improved.

Further, as shown in fig. 3, the embodiment of the present application further includes:

step S710: performing positive serialization adjustment on the similarity matching result to generate a similarity sequence, wherein the similarity sequence is arranged from large to small;

step S720: acquiring the number of the query requirement items;

step S730: intercepting the similarity sequence based on the number of the query requirement items, reversely matching and mapping the documents, and integrating the documents as a query document set;

step S740: based on the set of query documents, the document query results are determined, the document query results having document priorities.

Specifically, the matching result of the similarity matrix represents the similarity score of the keyword and each document, the similarity scores of all documents are ordered, and the similarity sequences are generated from big to small. Each element in the sequence corresponds to a document, the value of which represents the similarity of the keyword to the document. The number N of the documents required by the query represents the document space number required to be retrieved by the user, and the value of N can be set by the user during the query or can be determined by the default value of the system. And selecting the first N elements from the head of the similarity sequence, wherein the corresponding documents form a query document set which contains the first N documents most relevant to the keywords. And determining the priority of each document according to the position of each document in the similarity sequence in the query document set, wherein the higher the position is, the higher the priority is.

The method includes the steps of intercepting a similarity sequence generated by a similarity matrix matching result to obtain a document set with high matching degree with keywords, determining a document query result and the sequence thereof, achieving the purpose of providing most relevant document information according to query requirements, and improving efficiency and accuracy of screening electronic documents.

In summary, the method for screening and querying the electronic document provided by the embodiment of the application has the following technical effects:

connecting a service management system, determining a search domain, and providing a data base for subsequent construction of a search database and inquiry; acquiring a target document set based on a search domain, constructing a search database by cross-domain association, wherein the search database is composed of a plurality of search database sub-databases, each search database sub-database is in data type difference, and the search database is updated in real time, so that a unified search platform of large-scale documents is realized, the search efficiency is improved, and the database is updated in real time to support dynamic query; based on the query requirement, determining a primary keyword set and a secondary keyword set, configuring a keyword matrix, acquiring the secondary keyword set by the primary keyword set diversification, acquiring rich primary keywords and secondary keywords by the diversification, and supporting the diversification query requirement; combining a similarity matching algorithm, traversing a plurality of search data sub-databases to match the keyword matrixes to generate a similarity matrix, wherein the occurrence frequency of the keywords is additional generated information, and combining consideration of the occurrence frequency of the keywords to realize high-precision matching; setting a similarity threshold, judging a similarity matrix based on the similarity threshold, and determining a single keyword matching result, so that an accurate matching result of each keyword and a document is determined, only highly relevant matching is reserved, and the query accuracy is improved; based on the single keyword matching result, summing the similarity matrixes matrix by matrix columns to generate a similarity matching result, wherein the similarity matching result represents the comprehensive similarity between the matched keyword set and the single document in the search database, so that the calculated amount is reduced, and the efficiency is improved; and carrying out document mapping based on the similarity matching result, and determining a document query result, thereby realizing high-precision, dynamic and diversified query of the electronic document and achieving the technical effect of improving the screening accuracy and efficiency of the electronic document.

Example two

Based on the same inventive concept as the method for screening and querying electronic documents in the foregoing embodiments, as shown in fig. 4, an embodiment of the present application provides an electronic document screening and querying system, which is characterized in that the system includes:

a search domain determining module 11, configured to connect to a service management system, and determine a search domain;

a search database construction module 12, which acquires a target document set based on the search domain, and constructs a search database by cross-domain association, wherein the search database is composed of a plurality of search database sub-databases, each search database sub-database is in data type difference, and the search database is updated in real time;

a keyword matrix module 13, configured to determine a primary keyword set and a secondary keyword set based on a query requirement, and configure a keyword matrix, where the secondary keyword set is obtained by the primary keyword set diversification process;

the similarity matrix module 14 is configured to, in combination with a similarity matching algorithm, traverse the plurality of search databases to match the keyword matrix to generate a similarity matrix, where the occurrence frequency of the keyword is additional generated information;

the keyword matching result module 15 is configured to set a similarity threshold, determine the similarity matrix based on the similarity threshold, and determine a single keyword matching result, where a matching success identifier is 1, and a matching failure identifier is 0;

The similarity matching result module 16 sums the similarity matrixes matrix by matrix columns based on the single keyword matching result to generate a similarity matching result, wherein the similarity matching result represents the comprehensive similarity between the matched keyword set and the single document in the search database;

the document query result module 17 performs document mapping based on the similarity matching result to determine a document query result.

Further, the embodiment of the application further comprises:

a primary key set module that refines a plurality of primary keys as the primary key set based on the query requirement;

the diversification processing amplitude modulation module is used for configuring diversification processing amplitude modulation;

the first subordinate keyword set module is used for performing upper level processing on the main keyword set based on the diversification processing amplitude modulation to determine a first subordinate keyword set;

the second subordinate keyword set module is used for performing lower-level processing on the main keyword set based on the diversification processing amplitude modulation to determine a second subordinate keyword set;

the third subordinate keyword set module is used for carrying out conversion processing on the main keyword set and determining a third subordinate keyword set;

A secondary keyword set module, configured to determine a secondary keyword set based on the first secondary keyword set, the second secondary keyword set, and the third secondary keyword set, where the secondary keyword set has a primary relevance identifier;

and constructing a keyword matrix module, wherein the keyword matrix module is used for taking a keyword sequence as a matrix row, taking a keyword category as a matrix column, and constructing the keyword matrix based on the main keyword set and the auxiliary keyword set.

Further, the embodiment of the application further comprises:

a primary key set extraction module that extracts the primary key set based on the key matrix;

a similarity matrix module, configured to traverse the plurality of search databases, perform similarity matching on the primary keyword set, and determine a similarity matrix;

the two-term similarity matrix module is used for extracting the secondary keyword set and traversing the plurality of search data sub-databases to carry out similarity matching if the one-term similarity matrix is empty, and determining the two-term similarity matrix;

and the three-term similarity matrix module is used for traversing the plurality of search data sub-databases to perform semantic recognition based on the primary key word set if the two-term similarity matrix is empty, and acquiring the three-term similarity matrix.

Further, the embodiment of the application further comprises:

the matrix calculation formula module is used for carrying out similarity matrix technology, and the formula is as follows:

；

wherein,characterization keyword occurrence frequency, < ->Characterizing the main relevance, ++>For the keyword matrix to be matched, for the column matrix,/for the keyword matrix to be matched>For the document matrix in the search database, as a row matrix, < >>Representing the similarity between the M-th keyword and the N-th document, M, N being a magnitude, representing the number of terms of the keyword and the number of terms of the document, and +_for the one similarity matrix>=1。

Further, the embodiment of the application further comprises:

the matrix summation module is used for performing matrix summation, and the formula is as follows:

；

wherein,for the similarity matching result of M keywords and the jth document, the term +.>Representing the similarity of the ith keyword and the jth document, i +.>M，j/>N。

Further, the embodiment of the application comprises the following steps:

the multi-element data processing rule module is used for configuring multi-element data processing rules;

the preprocessing database module is used for executing rule matching and data preprocessing on the plurality of search database sub-databases based on the data processing rules to determine a preprocessing database;

and the matching execution module is used for executing matching execution of the keyword matrix based on the preprocessing database.

Further, the embodiment of the application further comprises:

the similarity sequence module is used for carrying out positive serialization adjustment on the similarity matching result to generate a similarity sequence, and the similarity sequence is arranged from large to small;

the query requirement item number module is used for acquiring the query requirement item number;

the query document collection module is used for intercepting the similarity sequence based on the number of the query requirement items, reversely matching and mapping documents, and integrating the documents as a query document collection;

and the document query result module is used for determining the document query result based on the query document set, wherein the document query result has the document priority.

Any of the steps of the methods described above may be stored as computer instructions or programs in a non-limiting computer memory and may be called by a non-limiting computer processor to identify any method for implementing an embodiment of the present application, without unnecessary limitations.

Further, the first or second element may not only represent a sequential relationship, but may also represent a particular concept, and/or may be selected individually or in whole among a plurality of elements. It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the scope of the application. Thus, the present application is intended to include such modifications and alterations insofar as they come within the scope of the application or the equivalents thereof.

Claims

1. An electronic document screening and querying method, comprising:

connecting a service management system, and determining a retrieval domain;

acquiring a target document set based on the search domain, and constructing a search database by cross-domain association, wherein the search database is composed of a plurality of search database sub-databases, each search database sub-database is different in data type, and the search database is updated in real time;

determining a primary keyword set and a secondary keyword set based on query requirements, and configuring a keyword matrix, wherein the secondary keyword set is obtained by diversification processing of the primary keyword set;

the method for configuring the keyword matrix comprises the following steps:

refining a plurality of primary keywords as the set of primary keywords based on the query requirement;

configuring diversified processing amplitude modulation;

performing upper level processing on the primary keyword set based on the multi-level processing amplitude modulation, and determining a first subordinate keyword set;

performing lower level processing on the primary keyword set based on the multi-level processing amplitude modulation, and determining a second subordinate keyword set;

performing conversion processing on the primary keyword set to determine a third subordinate keyword set;

Determining a secondary keyword set based on the first secondary keyword set, the second secondary keyword set and the third secondary keyword set, wherein the secondary keyword set is provided with a main relatedness identifier;

setting a keyword sequence as a matrix row, setting a keyword category as a matrix column, and setting up the keyword matrix based on the primary keyword set and the secondary keyword set;

combining a similarity matching algorithm, traversing the plurality of search data sub-databases to match the keyword matrix, and generating a similarity matrix, wherein the occurrence frequency of the keywords is additional generation information;

the method for generating the similarity matrix comprises the following steps:

extracting the primary keyword set based on the keyword matrix;

traversing the plurality of search databases, performing similarity matching on the primary key word set, and determining a similarity matrix;

if the similarity matrix is empty, extracting the secondary keyword set, traversing the plurality of search data sub-databases to perform similarity matching, and determining a two-term similarity matrix;

if the two similarity matrixes are empty, traversing the plurality of search data sub-databases to perform semantic recognition based on the primary key word set, and acquiring three similarity matrixes;

The method for obtaining the similarity matrix calculation formula comprises the following steps:

；

wherein,characterization keyword occurrence frequency, < ->Characterizing the main relevance, ++>For the keyword matrix to be matched, for the column matrix,/for the keyword matrix to be matched>For the document matrix in the search database, forRow matrix->Representing the similarity of the ith keyword to the jth document, M, N being the magnitude, characterizing the number of key terms and the number of document terms, and aiming at the similarity matrix of one term, the terms ++>=1；

Setting a similarity threshold, judging the similarity matrix based on the similarity threshold, and determining a single keyword matching result, wherein a matching success mark is 1, and a matching failure mark is 0;

summing the similarity matrixes by matrix columns based on the single keyword matching result to generate a similarity matching result, wherein the similarity matching result represents the comprehensive similarity between a matched keyword set and a single document in the search database;

and carrying out document mapping based on the similarity matching result to determine a document query result.

2. The method of claim 1, wherein obtaining a matrix sum formula, the method comprising:

；

wherein,for the similarity matching result of M keywords and the jth document, the term +. >And the similarity between the ith keyword and the jth document is represented.

3. The method of claim 1, wherein prior to matching the keyword matrix in the plurality of search databases, the method comprises:

configuring a multi-metadata processing rule;

based on the data processing rules, performing rule matching and data preprocessing on the plurality of search data sub-databases to determine a preprocessing database;

and carrying out matching execution of the keyword matrix based on the preprocessing database.

4. The method of claim 1, wherein the determining the document query results, the method comprising:

performing positive serialization adjustment on the similarity matching result to generate a similarity sequence, wherein the similarity sequence is arranged from large to small;

acquiring the number of the query requirement items;

intercepting the similarity sequence based on the number of the query requirement items, reversely matching and mapping the documents, and integrating the documents as a query document set;

based on the set of query documents, the document query results are determined, the document query results having document priorities.

5. An electronic document screening query system, the system comprising:

The search domain determining module is used for connecting a service management system and determining a search domain;

the search database construction module acquires a target document set based on the search domain, and constructs a search database in a cross-domain association mode, wherein the search database is composed of a plurality of search database sub-databases, each search database sub-database is different in data type, and the search database is updated in real time;

the keyword matrix module is used for determining a main keyword set and a secondary keyword set based on query requirements, and configuring a keyword matrix, wherein the secondary keyword set is obtained by the main keyword set diversification;

wherein, the keyword matrix module further comprises:

a keyword matrix module is built, which is used for taking a keyword sequence as a matrix row, taking a keyword category as a matrix column, and building the keyword matrix based on the main keyword set and the auxiliary keyword set;

the similarity matrix module is used for combining a similarity matching algorithm, traversing the plurality of search data sub-databases to match the keyword matrix, and generating a similarity matrix, wherein the occurrence frequency of the keywords is additional generation information;

wherein, the similarity matrix module further includes:

the three-term similarity matrix module is used for traversing the plurality of search data sub-databases to perform semantic recognition based on the primary key word set if the two-term similarity matrix is empty, so as to acquire the three-term similarity matrix;

wherein, the similarity matrix module further includes:

；

wherein,characterization keyword occurrence frequency, < ->Characterizing the main relevance, ++>For the keyword matrix to be matched, for the column matrix,/for the keyword matrix to be matched>For the document matrix in the search database, as a row matrix, < >>Representing the similarity of the ith keyword to the jth document, M, N being the magnitude, characterizing the number of key terms and the number of document terms, and aiming at the similarity matrix of one term, the terms ++>=1；

The keyword matching result module is used for setting a similarity threshold, judging the similarity matrix based on the similarity threshold, and determining a single keyword matching result, wherein the matching success mark is 1, and the matching failure mark is 0;

The similarity matching result module is used for summing the similarity matrixes matrix by matrix columns based on the single keyword matching result to generate a similarity matching result, and the similarity matching result represents the comprehensive similarity between the matched keyword set and the single document in the search database;

and the document query result module is used for carrying out document mapping based on the similarity matching result to determine a document query result.