CN115544225A - Digital archive information association retrieval method based on semantics - Google Patents
Digital archive information association retrieval method based on semantics Download PDFInfo
- Publication number
- CN115544225A CN115544225A CN202211047113.8A CN202211047113A CN115544225A CN 115544225 A CN115544225 A CN 115544225A CN 202211047113 A CN202211047113 A CN 202211047113A CN 115544225 A CN115544225 A CN 115544225A
- Authority
- CN
- China
- Prior art keywords
- semantic
- retrieval
- sim
- file
- knowledge
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the technical field of information retrieval application, and particularly relates to a digital archive information association retrieval method based on semantics. The invention adds semantic similarity retrieval on the basis of the traditional keyword retrieval, so that the retrieval range is wider and more comprehensive, and the provided data is comprehensive and accurate, thereby solving the technical problems in the existing file management.
Description
Technical Field
The invention belongs to the technical field of information retrieval application, and particularly relates to a digital archive information association retrieval method based on semantics.
Background
With the rapid development of information technology and artificial intelligence technology, people have entered the digital era from the industrial era, and under the digital era, the digital form transformation of traditional information resources presents an explosive growth situation, and digital archive information, as a special type of digital information resources, has become the mainstream choice for the archive management organization to describe and record personal and event information under the era environment.
The change of the content form causes the storage and utilization mode of the file information to change correspondingly, and the following problems are that: how to accurately and quickly acquire required target content from massive digital archive information? This puts higher demands on the retrieval technology for digital archive information.
By analyzing the retrieval method of the digital file, the existing retrieval method is single, and is limited to simply matching and querying the natural language input by the user, including directory retrieval, content matching retrieval and the like. When massive information is faced, the retrieval mode has the characteristics of low efficiency, incompleteness and misalignment, namely, the retrieved content contains a large amount of irrelevant information, or the retrieved result is only limited to the content containing keywords, so that the comprehensive and accurate file information content cannot be provided for users.
Disclosure of Invention
Aiming at the technical problems of the digital archive retrieval, the invention provides a semantic-based digital archive information association retrieval method which is reasonable in design, simple in method, comprehensive and accurate in retrieval.
In order to achieve the above object, the technical solution of the present invention is that the present invention provides a semantic-based digital archive information association retrieval method, which is characterized by comprising the following steps:
a. firstly, carrying out digital processing on archive information resources;
b. b, performing element classification on the file information resources subjected to the digital processing in the step a according to key elements such as events, units and the like of the files, correctly extracting relevant knowledge related to a file body, determining recognized basic vocabularies, and giving semantic relations among the knowledge to construct RDF triples;
c. carrying out synonym expansion on the RDF triples constructed in the step b, and storing the expanded knowledge in corresponding triples;
d. then, realizing keyword association matching between different archive information resources according to the event topic or nodes extended from the event topic to form a semantic knowledge graph model;
e. then, reading keyword information input by a user during retrieval, and sequencing and outputting the obtained resources through semantic analysis and retrieval by utilizing the semantic knowledge map model established in the step d;
f. returning the finally inquired retrieval result to the user;
in the step e, the semantic analysis and retrieval includes direct matching retrieval and semantic similarity calculation matching retrieval, wherein the semantic similarity calculation matching retrieval obtains data content most matched with the keywords through calculation and associated information content corresponding to the retrieved keyword main body query, and performs relevance ranking and output as a whole, and the semantic similarity calculation formula is:
sim S (m,n)=α*sim A (m,n)+β*sim C (m,n)+γ*sim L (m,n)
wherein m and n are two different files, alpha, beta and gamma are adjusting parameters with the value range of 0-1, sim A (m, n) is the attribute correlation contained between profiles m and n, sim C (m, n) is the maximum semantic cosine distance between files m and n, sim L (m, n) is the path distance between profiles m and n.
Preferably, in step e, sim A The formula for the calculation of (m, n) is:
where f (m ≦ n) is the similarity between the same attributes of files m and n, f (m-n) is the number of attributes that file m contains but file n does not, and f (n-m) is the number of attributes that file n contains but file m does not.
Preferably, in step e, sim C The formula for calculating (m, n) is:
sim C (m,n)=cos(m,n)。
Preferably, the sim is L The formula for the calculation of (m, n) is:
wherein length (m, n) is a path distance parameter between the file m and the file n,for adjusting the parameters, the value is 1.
Preferably, the semantic knowledge map model further comprises a profile knowledge extraction module and a knowledge storage module, wherein the profile knowledge extraction module comprises atomic information elements of the profile and RDF triple extraction of the profile.
Preferably, the knowledge storage module stores the semantic knowledge map model by using a Neo4j database.
Compared with the prior art, the invention has the advantages and positive effects that,
1. the invention provides a digital archive information correlation retrieval method based on semantics, which increases semantic similarity retrieval on the basis of traditional keyword retrieval to enable the retrieval range to be wider and more comprehensive, thereby enabling the provided data to be comprehensive and accurate, and further solving the technical problems in the existing archive management.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.
FIG. 1 is a flow chart of a semantic-based digital archive information association retrieval method provided in embodiment 1;
fig. 2 is a flowchart of the semantic knowledge graph model provided in example 1.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, the present invention will be further described with reference to the accompanying drawings and examples. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and thus the present invention is not limited to the specific embodiments of the present disclosure.
Embodiment 1, as shown in fig. 1 and fig. 2, this embodiment aims to provide a retrieval method with a wider retrieval range and a more comprehensive and accurate retrieval result, and for this purpose, the digital archive information association retrieval method based on semantics provided by this embodiment includes the following steps,
firstly, file information resources are digitized, with the rapid development of informatization, currently, part of existing file data is digitized, and the part of files does not need to be processed, and is mainly subjected to digitized processing by scanning and other modes aiming at paper files so as to facilitate subsequent calculation processing.
In order to accurately find the corresponding file, the content of the file needs to be known to a certain extent, and therefore, the file information resource carries out element classification according to key elements such as events, units and the like of the file, correctly extracts relevant knowledge related to a file body, determines recognized basic words and provides semantic relation between the knowledge to construct RDF triples. The RDF triples are mainly used for endowing attributes and values to the related knowledge and the basic vocabulary in the archive, so that the later calculation is facilitated.
Considering that more accurate acquisition of the related profile is required, the profile needs to be known more, and if the event is taken as a core and the external concept is expanded, more profile information is associated to form an association relationship network. Under the condition of no storage limitation, theoretically, a relation relationship network can be infinitely extended to relate to all related information, in the embodiment, two key factors influencing searching are considered, namely the relation between synonyms and key words, for this reason, firstly, synonym expansion is carried out on RDF triples, the knowledge after expansion is stored in corresponding triples, then, key word correlation matching between different archive information resources is realized according to event topics or nodes extended from the event topics, and a semantic knowledge graph model is formed. The entity extraction mainly includes acquiring atomic information elements of the archive information, including names of persons, names of organizations, geographic locations, dates, and the like. The triple extraction is mainly realized according to a single rule between elements during writing so as to obtain the incidence relation between the atomic information elements and form triple instance data. And the knowledge storage module stores the semantic knowledge map model by adopting a Neo4j database. Thus, a complete semantic knowledge map model is formed.
Then, reading keyword information input by a user during retrieval, sequencing and outputting the acquired resources through semantic analysis and retrieval according to a semantic knowledge graph model, wherein the semantic analysis and retrieval comprises direct matching retrieval and semantic similarity calculation matching retrieval, the direct matching is that the keywords or synonyms can be matched in the retrieval, and sequencing and outputting are performed according to the relevance calculation; if the database does not have the content matched with the search word, performing semantic expansion, performing semantic similarity calculation matching search to obtain data content most matched with the keyword through calculation and related information content corresponding to the searched keyword main body query, performing relevance sequencing on the whole and outputting, wherein the semantic similarity calculation formula is as follows:
sim S (m,n)=α*sim A (m,n)+β*sim C (m,n)+γ*sim L (m,n)
wherein m and n are two different files, alpha, beta and gamma are adjusting parameters with the value range of 0-1, sim A (m, n) is the attribute correlation contained between profiles m and n, sim C (m, n) is the maximum semantic cosine distance between files m and n, sim L (m, n) is the path distance between profiles m and n.
sim A The formula for the calculation of (m, n) is:
where f (m ≦ n) is the similarity between the same attributes of files m and n, f (m-n) is the number of attributes that file m contains but file n does not, and f (n-m) is the number of attributes that file n contains but file m does not. sim C The formula for the calculation of (m, n) is:
sim C (m,n)=cos(m,n)。
sim L the formula for the calculation of (m, n) is:
wherein length (m, n) is a parameter of the path distance between the jumping of the file m to the file n,for adjusting the parameters, the value is 1.
Therefore, the retrieval range is enlarged through semantic similarity calculation, so that files meeting the conditions can be retrieved, and the retrieval result is more accurate and comprehensive.
And finally, returning the finally inquired retrieval result to the user.
The above description is only a preferred embodiment of the present invention, and not intended to limit the present invention in other forms, and any person skilled in the art may apply the above modifications or changes to the equivalent embodiments with equivalent changes, without departing from the technical spirit of the present invention, and any simple modification, equivalent change and change made to the above embodiments according to the technical spirit of the present invention still belong to the protection scope of the technical spirit of the present invention.
Claims (6)
1. A digital archive information correlation retrieval method based on semantics is characterized by comprising the following steps:
a. firstly, carrying out digital processing on archive information resources;
b. b, performing element classification on the file information resources subjected to the digital processing in the step a according to key elements such as events, units and the like of the files, correctly extracting relevant knowledge related to a file body, determining recognized basic vocabularies, and giving semantic relations among the knowledge to construct RDF triples;
c. carrying out synonym expansion on the RDF triples constructed in the step b, and storing the expanded knowledge in corresponding triples;
d. then, realizing keyword correlation matching between different archive information resources according to the event theme or nodes extended from the event theme to form a semantic knowledge graph model;
e. then, reading keyword information input by a user during retrieval, and sequencing and outputting the obtained resources through semantic analysis and retrieval by utilizing the semantic knowledge map model established in the step d;
f. returning the finally inquired retrieval result to the user;
in the step e, the semantic analysis and retrieval includes direct matching retrieval and semantic similarity calculation matching retrieval, wherein the semantic similarity calculation matching retrieval obtains data content most matched with the keywords through calculation and associated information content corresponding to the retrieved keyword main body query, and performs relevance ranking and output as a whole, and the semantic similarity calculation formula is:
sim S (m,n)=α*sim A (m,n)+β*sim C (m,n)+γ*sim L (m,n)
whereinM and n are two different files, alpha, beta and gamma are adjusting parameters with the value range between 0 and 1, sim A (m, n) is the attribute correlation contained between profiles m and n, sim C (m, n) is the maximum semantic cosine distance between files m and n, sim L (m, n) is the path distance between profiles m and n.
2. The digital archive information correlation retrieval method based on semantics as claimed in claim 1, wherein in the step e, sim A The formula for the calculation of (m, n) is:
where f (m ≦ n) is the similarity between the same attributes of files m and n, f (m-n) is the number of attributes that file m contains but file n does not, and f (n-m) is the number of attributes that file n contains but file m does not.
3. The digital archive information correlation retrieval method based on semantics of claim 1, wherein in the e step, sim C The formula for the calculation of (m, n) is:
sim C (m,n)=cos(m,n)。
4. the semantic-based digital archive information association retrieval method of claim 3, wherein the sim is a file name of the digital archive information L The formula for the calculation of (m, n) is:
5. The semantic-based digital archive information association retrieval method of claim 4, wherein the semantic knowledge map model further comprises an archive knowledge extraction module and a knowledge storage module, wherein the archive knowledge extraction module comprises atomic information elements of the archive and RDF triple extraction of the archive.
6. The semantic-based digital archive information association retrieval method of claim 5, wherein the knowledge storage module stores the semantic knowledge map model by using a Neo4j database.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211047113.8A CN115544225A (en) | 2022-08-30 | 2022-08-30 | Digital archive information association retrieval method based on semantics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211047113.8A CN115544225A (en) | 2022-08-30 | 2022-08-30 | Digital archive information association retrieval method based on semantics |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115544225A true CN115544225A (en) | 2022-12-30 |
Family
ID=84724959
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211047113.8A Pending CN115544225A (en) | 2022-08-30 | 2022-08-30 | Digital archive information association retrieval method based on semantics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115544225A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116756396A (en) * | 2023-06-29 | 2023-09-15 | 广东齐峰信息科技有限公司 | Digital archive management system and method based on knowledge graph |
-
2022
- 2022-08-30 CN CN202211047113.8A patent/CN115544225A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116756396A (en) * | 2023-06-29 | 2023-09-15 | 广东齐峰信息科技有限公司 | Digital archive management system and method based on knowledge graph |
CN116756396B (en) * | 2023-06-29 | 2023-12-22 | 广东齐峰信息科技有限公司 | Digital archive management system and method based on knowledge graph |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111241241B (en) | Case retrieval method, device, equipment and storage medium based on knowledge graph | |
US7844592B2 (en) | Ontology-content-based filtering method for personalized newspapers | |
US10289717B2 (en) | Semantic search apparatus and method using mobile terminal | |
US8838650B2 (en) | Method and apparatus for preprocessing a plurality of documents for search and for presenting search result | |
US20100094835A1 (en) | Automatic query concepts identification and drifting for web search | |
US20090182723A1 (en) | Ranking search results using author extraction | |
US20110078205A1 (en) | Method and system for finding appropriate semantic web ontology terms from words | |
US20100280989A1 (en) | Ontology creation by reference to a knowledge corpus | |
US20100318537A1 (en) | Providing knowledge content to users | |
US20060184517A1 (en) | Answers analytics: computing answers across discrete data | |
KR20010042377A (en) | Information retrieval and speech recognition based on language models | |
EP1716511A1 (en) | Intelligent search and retrieval system and method | |
CN103577416A (en) | Query expansion method and system | |
CN107844493B (en) | File association method and system | |
US9971828B2 (en) | Document tagging and retrieval using per-subject dictionaries including subject-determining-power scores for entries | |
CN111309944B (en) | Digital humane searching method based on graph database | |
CN114547253A (en) | Semantic search method based on knowledge base application | |
US20120130999A1 (en) | Method and Apparatus for Searching Electronic Documents | |
Nuray-Turan et al. | Attribute and object selection queries on objects with probabilistic attributes | |
CN115544225A (en) | Digital archive information association retrieval method based on semantics | |
Xu et al. | Query aware determinization of uncertain objects | |
Weikum et al. | Temporal knowledge for timely intelligence | |
KR101303363B1 (en) | Data processing system and method | |
Selvan et al. | ASE: Automatic search engine for dynamic information retrieval | |
CN112347289A (en) | Image management method and terminal |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |