CN115544225A - Digital archive information association retrieval method based on semantics - Google Patents

Digital archive information association retrieval method based on semantics Download PDF

Info

Publication number
CN115544225A
CN115544225A CN202211047113.8A CN202211047113A CN115544225A CN 115544225 A CN115544225 A CN 115544225A CN 202211047113 A CN202211047113 A CN 202211047113A CN 115544225 A CN115544225 A CN 115544225A
Authority
CN
China
Prior art keywords
semantic
retrieval
sim
file
knowledge
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211047113.8A
Other languages
Chinese (zh)
Inventor
冯炫
马林聪
曹豪
潘冬
苗思宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shaanxi Zhiyin Technology Co ltd
Original Assignee
Shaanxi Zhiyin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shaanxi Zhiyin Technology Co ltd filed Critical Shaanxi Zhiyin Technology Co ltd
Priority to CN202211047113.8A priority Critical patent/CN115544225A/en
Publication of CN115544225A publication Critical patent/CN115544225A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of information retrieval application, and particularly relates to a digital archive information association retrieval method based on semantics. The invention adds semantic similarity retrieval on the basis of the traditional keyword retrieval, so that the retrieval range is wider and more comprehensive, and the provided data is comprehensive and accurate, thereby solving the technical problems in the existing file management.

Description

Digital archive information association retrieval method based on semantics
Technical Field
The invention belongs to the technical field of information retrieval application, and particularly relates to a digital archive information association retrieval method based on semantics.
Background
With the rapid development of information technology and artificial intelligence technology, people have entered the digital era from the industrial era, and under the digital era, the digital form transformation of traditional information resources presents an explosive growth situation, and digital archive information, as a special type of digital information resources, has become the mainstream choice for the archive management organization to describe and record personal and event information under the era environment.
The change of the content form causes the storage and utilization mode of the file information to change correspondingly, and the following problems are that: how to accurately and quickly acquire required target content from massive digital archive information? This puts higher demands on the retrieval technology for digital archive information.
By analyzing the retrieval method of the digital file, the existing retrieval method is single, and is limited to simply matching and querying the natural language input by the user, including directory retrieval, content matching retrieval and the like. When massive information is faced, the retrieval mode has the characteristics of low efficiency, incompleteness and misalignment, namely, the retrieved content contains a large amount of irrelevant information, or the retrieved result is only limited to the content containing keywords, so that the comprehensive and accurate file information content cannot be provided for users.
Disclosure of Invention
Aiming at the technical problems of the digital archive retrieval, the invention provides a semantic-based digital archive information association retrieval method which is reasonable in design, simple in method, comprehensive and accurate in retrieval.
In order to achieve the above object, the technical solution of the present invention is that the present invention provides a semantic-based digital archive information association retrieval method, which is characterized by comprising the following steps:
a. firstly, carrying out digital processing on archive information resources;
b. b, performing element classification on the file information resources subjected to the digital processing in the step a according to key elements such as events, units and the like of the files, correctly extracting relevant knowledge related to a file body, determining recognized basic vocabularies, and giving semantic relations among the knowledge to construct RDF triples;
c. carrying out synonym expansion on the RDF triples constructed in the step b, and storing the expanded knowledge in corresponding triples;
d. then, realizing keyword association matching between different archive information resources according to the event topic or nodes extended from the event topic to form a semantic knowledge graph model;
e. then, reading keyword information input by a user during retrieval, and sequencing and outputting the obtained resources through semantic analysis and retrieval by utilizing the semantic knowledge map model established in the step d;
f. returning the finally inquired retrieval result to the user;
in the step e, the semantic analysis and retrieval includes direct matching retrieval and semantic similarity calculation matching retrieval, wherein the semantic similarity calculation matching retrieval obtains data content most matched with the keywords through calculation and associated information content corresponding to the retrieved keyword main body query, and performs relevance ranking and output as a whole, and the semantic similarity calculation formula is:
sim S (m,n)=α*sim A (m,n)+β*sim C (m,n)+γ*sim L (m,n)
wherein m and n are two different files, alpha, beta and gamma are adjusting parameters with the value range of 0-1, sim A (m, n) is the attribute correlation contained between profiles m and n, sim C (m, n) is the maximum semantic cosine distance between files m and n, sim L (m, n) is the path distance between profiles m and n.
Preferably, in step e, sim A The formula for the calculation of (m, n) is:
Figure BDA0003822783980000021
where f (m ≦ n) is the similarity between the same attributes of files m and n, f (m-n) is the number of attributes that file m contains but file n does not, and f (n-m) is the number of attributes that file n contains but file m does not.
Preferably, in step e, sim C The formula for calculating (m, n) is:
sim C (m,n)=cos(m,n)。
Preferably, the sim is L The formula for the calculation of (m, n) is:
Figure BDA0003822783980000022
wherein length (m, n) is a path distance parameter between the file m and the file n,
Figure BDA0003822783980000023
for adjusting the parameters, the value is 1.
Preferably, the semantic knowledge map model further comprises a profile knowledge extraction module and a knowledge storage module, wherein the profile knowledge extraction module comprises atomic information elements of the profile and RDF triple extraction of the profile.
Preferably, the knowledge storage module stores the semantic knowledge map model by using a Neo4j database.
Compared with the prior art, the invention has the advantages and positive effects that,
1. the invention provides a digital archive information correlation retrieval method based on semantics, which increases semantic similarity retrieval on the basis of traditional keyword retrieval to enable the retrieval range to be wider and more comprehensive, thereby enabling the provided data to be comprehensive and accurate, and further solving the technical problems in the existing archive management.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.
FIG. 1 is a flow chart of a semantic-based digital archive information association retrieval method provided in embodiment 1;
fig. 2 is a flowchart of the semantic knowledge graph model provided in example 1.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, the present invention will be further described with reference to the accompanying drawings and examples. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and thus the present invention is not limited to the specific embodiments of the present disclosure.
Embodiment 1, as shown in fig. 1 and fig. 2, this embodiment aims to provide a retrieval method with a wider retrieval range and a more comprehensive and accurate retrieval result, and for this purpose, the digital archive information association retrieval method based on semantics provided by this embodiment includes the following steps,
firstly, file information resources are digitized, with the rapid development of informatization, currently, part of existing file data is digitized, and the part of files does not need to be processed, and is mainly subjected to digitized processing by scanning and other modes aiming at paper files so as to facilitate subsequent calculation processing.
In order to accurately find the corresponding file, the content of the file needs to be known to a certain extent, and therefore, the file information resource carries out element classification according to key elements such as events, units and the like of the file, correctly extracts relevant knowledge related to a file body, determines recognized basic words and provides semantic relation between the knowledge to construct RDF triples. The RDF triples are mainly used for endowing attributes and values to the related knowledge and the basic vocabulary in the archive, so that the later calculation is facilitated.
Considering that more accurate acquisition of the related profile is required, the profile needs to be known more, and if the event is taken as a core and the external concept is expanded, more profile information is associated to form an association relationship network. Under the condition of no storage limitation, theoretically, a relation relationship network can be infinitely extended to relate to all related information, in the embodiment, two key factors influencing searching are considered, namely the relation between synonyms and key words, for this reason, firstly, synonym expansion is carried out on RDF triples, the knowledge after expansion is stored in corresponding triples, then, key word correlation matching between different archive information resources is realized according to event topics or nodes extended from the event topics, and a semantic knowledge graph model is formed. The entity extraction mainly includes acquiring atomic information elements of the archive information, including names of persons, names of organizations, geographic locations, dates, and the like. The triple extraction is mainly realized according to a single rule between elements during writing so as to obtain the incidence relation between the atomic information elements and form triple instance data. And the knowledge storage module stores the semantic knowledge map model by adopting a Neo4j database. Thus, a complete semantic knowledge map model is formed.
Then, reading keyword information input by a user during retrieval, sequencing and outputting the acquired resources through semantic analysis and retrieval according to a semantic knowledge graph model, wherein the semantic analysis and retrieval comprises direct matching retrieval and semantic similarity calculation matching retrieval, the direct matching is that the keywords or synonyms can be matched in the retrieval, and sequencing and outputting are performed according to the relevance calculation; if the database does not have the content matched with the search word, performing semantic expansion, performing semantic similarity calculation matching search to obtain data content most matched with the keyword through calculation and related information content corresponding to the searched keyword main body query, performing relevance sequencing on the whole and outputting, wherein the semantic similarity calculation formula is as follows:
sim S (m,n)=α*sim A (m,n)+β*sim C (m,n)+γ*sim L (m,n)
wherein m and n are two different files, alpha, beta and gamma are adjusting parameters with the value range of 0-1, sim A (m, n) is the attribute correlation contained between profiles m and n, sim C (m, n) is the maximum semantic cosine distance between files m and n, sim L (m, n) is the path distance between profiles m and n.
sim A The formula for the calculation of (m, n) is:
Figure BDA0003822783980000051
where f (m ≦ n) is the similarity between the same attributes of files m and n, f (m-n) is the number of attributes that file m contains but file n does not, and f (n-m) is the number of attributes that file n contains but file m does not. sim C The formula for the calculation of (m, n) is:
sim C (m,n)=cos(m,n)。
sim L the formula for the calculation of (m, n) is:
Figure BDA0003822783980000052
wherein length (m, n) is a parameter of the path distance between the jumping of the file m to the file n,
Figure BDA0003822783980000053
for adjusting the parameters, the value is 1.
Therefore, the retrieval range is enlarged through semantic similarity calculation, so that files meeting the conditions can be retrieved, and the retrieval result is more accurate and comprehensive.
And finally, returning the finally inquired retrieval result to the user.
The above description is only a preferred embodiment of the present invention, and not intended to limit the present invention in other forms, and any person skilled in the art may apply the above modifications or changes to the equivalent embodiments with equivalent changes, without departing from the technical spirit of the present invention, and any simple modification, equivalent change and change made to the above embodiments according to the technical spirit of the present invention still belong to the protection scope of the technical spirit of the present invention.

Claims (6)

1. A digital archive information correlation retrieval method based on semantics is characterized by comprising the following steps:
a. firstly, carrying out digital processing on archive information resources;
b. b, performing element classification on the file information resources subjected to the digital processing in the step a according to key elements such as events, units and the like of the files, correctly extracting relevant knowledge related to a file body, determining recognized basic vocabularies, and giving semantic relations among the knowledge to construct RDF triples;
c. carrying out synonym expansion on the RDF triples constructed in the step b, and storing the expanded knowledge in corresponding triples;
d. then, realizing keyword correlation matching between different archive information resources according to the event theme or nodes extended from the event theme to form a semantic knowledge graph model;
e. then, reading keyword information input by a user during retrieval, and sequencing and outputting the obtained resources through semantic analysis and retrieval by utilizing the semantic knowledge map model established in the step d;
f. returning the finally inquired retrieval result to the user;
in the step e, the semantic analysis and retrieval includes direct matching retrieval and semantic similarity calculation matching retrieval, wherein the semantic similarity calculation matching retrieval obtains data content most matched with the keywords through calculation and associated information content corresponding to the retrieved keyword main body query, and performs relevance ranking and output as a whole, and the semantic similarity calculation formula is:
sim S (m,n)=α*sim A (m,n)+β*sim C (m,n)+γ*sim L (m,n)
whereinM and n are two different files, alpha, beta and gamma are adjusting parameters with the value range between 0 and 1, sim A (m, n) is the attribute correlation contained between profiles m and n, sim C (m, n) is the maximum semantic cosine distance between files m and n, sim L (m, n) is the path distance between profiles m and n.
2. The digital archive information correlation retrieval method based on semantics as claimed in claim 1, wherein in the step e, sim A The formula for the calculation of (m, n) is:
Figure FDA0003822783970000011
where f (m ≦ n) is the similarity between the same attributes of files m and n, f (m-n) is the number of attributes that file m contains but file n does not, and f (n-m) is the number of attributes that file n contains but file m does not.
3. The digital archive information correlation retrieval method based on semantics of claim 1, wherein in the e step, sim C The formula for the calculation of (m, n) is:
sim C (m,n)=cos(m,n)。
4. the semantic-based digital archive information association retrieval method of claim 3, wherein the sim is a file name of the digital archive information L The formula for the calculation of (m, n) is:
Figure FDA0003822783970000021
wherein length (m, n) is a parameter of the path distance between the jumping of the file m to the file n,
Figure FDA0003822783970000022
for adjusting the parameters, the value is 1.
5. The semantic-based digital archive information association retrieval method of claim 4, wherein the semantic knowledge map model further comprises an archive knowledge extraction module and a knowledge storage module, wherein the archive knowledge extraction module comprises atomic information elements of the archive and RDF triple extraction of the archive.
6. The semantic-based digital archive information association retrieval method of claim 5, wherein the knowledge storage module stores the semantic knowledge map model by using a Neo4j database.
CN202211047113.8A 2022-08-30 2022-08-30 Digital archive information association retrieval method based on semantics Pending CN115544225A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211047113.8A CN115544225A (en) 2022-08-30 2022-08-30 Digital archive information association retrieval method based on semantics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211047113.8A CN115544225A (en) 2022-08-30 2022-08-30 Digital archive information association retrieval method based on semantics

Publications (1)

Publication Number Publication Date
CN115544225A true CN115544225A (en) 2022-12-30

Family

ID=84724959

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211047113.8A Pending CN115544225A (en) 2022-08-30 2022-08-30 Digital archive information association retrieval method based on semantics

Country Status (1)

Country Link
CN (1) CN115544225A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116756396A (en) * 2023-06-29 2023-09-15 广东齐峰信息科技有限公司 Digital archive management system and method based on knowledge graph

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116756396A (en) * 2023-06-29 2023-09-15 广东齐峰信息科技有限公司 Digital archive management system and method based on knowledge graph
CN116756396B (en) * 2023-06-29 2023-12-22 广东齐峰信息科技有限公司 Digital archive management system and method based on knowledge graph

Similar Documents

Publication Publication Date Title
CN111241241B (en) Case retrieval method, device, equipment and storage medium based on knowledge graph
US7844592B2 (en) Ontology-content-based filtering method for personalized newspapers
US10289717B2 (en) Semantic search apparatus and method using mobile terminal
US8838650B2 (en) Method and apparatus for preprocessing a plurality of documents for search and for presenting search result
US20100094835A1 (en) Automatic query concepts identification and drifting for web search
US20090182723A1 (en) Ranking search results using author extraction
US20110078205A1 (en) Method and system for finding appropriate semantic web ontology terms from words
US20100280989A1 (en) Ontology creation by reference to a knowledge corpus
US20100318537A1 (en) Providing knowledge content to users
US20060184517A1 (en) Answers analytics: computing answers across discrete data
KR20010042377A (en) Information retrieval and speech recognition based on language models
EP1716511A1 (en) Intelligent search and retrieval system and method
CN103577416A (en) Query expansion method and system
CN107844493B (en) File association method and system
US9971828B2 (en) Document tagging and retrieval using per-subject dictionaries including subject-determining-power scores for entries
CN111309944B (en) Digital humane searching method based on graph database
CN114547253A (en) Semantic search method based on knowledge base application
US20120130999A1 (en) Method and Apparatus for Searching Electronic Documents
Nuray-Turan et al. Attribute and object selection queries on objects with probabilistic attributes
CN115544225A (en) Digital archive information association retrieval method based on semantics
Xu et al. Query aware determinization of uncertain objects
Weikum et al. Temporal knowledge for timely intelligence
KR101303363B1 (en) Data processing system and method
Selvan et al. ASE: Automatic search engine for dynamic information retrieval
CN112347289A (en) Image management method and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination