CN115544225A

CN115544225A - Digital archive information association retrieval method based on semantics

Info

Publication number: CN115544225A
Application number: CN202211047113.8A
Authority: CN
Inventors: 冯炫; 马林聪; 曹豪; 潘冬; 苗思宇
Original assignee: Shaanxi Zhiyin Technology Co ltd
Current assignee: Shaanxi Zhiyin Technology Co ltd
Priority date: 2022-08-30
Filing date: 2022-08-30
Publication date: 2022-12-30

Abstract

The invention belongs to the technical field of information retrieval application, and particularly relates to a digital archive information association retrieval method based on semantics. The invention adds semantic similarity retrieval on the basis of the traditional keyword retrieval, so that the retrieval range is wider and more comprehensive, and the provided data is comprehensive and accurate, thereby solving the technical problems in the existing file management.

Description

Digital archive information association retrieval method based on semantics

Technical Field

The invention belongs to the technical field of information retrieval application, and particularly relates to a digital archive information association retrieval method based on semantics.

Background

With the rapid development of information technology and artificial intelligence technology, people have entered the digital era from the industrial era, and under the digital era, the digital form transformation of traditional information resources presents an explosive growth situation, and digital archive information, as a special type of digital information resources, has become the mainstream choice for the archive management organization to describe and record personal and event information under the era environment.

The change of the content form causes the storage and utilization mode of the file information to change correspondingly, and the following problems are that: how to accurately and quickly acquire required target content from massive digital archive information? This puts higher demands on the retrieval technology for digital archive information.

By analyzing the retrieval method of the digital file, the existing retrieval method is single, and is limited to simply matching and querying the natural language input by the user, including directory retrieval, content matching retrieval and the like. When massive information is faced, the retrieval mode has the characteristics of low efficiency, incompleteness and misalignment, namely, the retrieved content contains a large amount of irrelevant information, or the retrieved result is only limited to the content containing keywords, so that the comprehensive and accurate file information content cannot be provided for users.

Disclosure of Invention

Aiming at the technical problems of the digital archive retrieval, the invention provides a semantic-based digital archive information association retrieval method which is reasonable in design, simple in method, comprehensive and accurate in retrieval.

In order to achieve the above object, the technical solution of the present invention is that the present invention provides a semantic-based digital archive information association retrieval method, which is characterized by comprising the following steps:

a. firstly, carrying out digital processing on archive information resources;

b. b, performing element classification on the file information resources subjected to the digital processing in the step a according to key elements such as events, units and the like of the files, correctly extracting relevant knowledge related to a file body, determining recognized basic vocabularies, and giving semantic relations among the knowledge to construct RDF triples;

c. carrying out synonym expansion on the RDF triples constructed in the step b, and storing the expanded knowledge in corresponding triples;

d. then, realizing keyword association matching between different archive information resources according to the event topic or nodes extended from the event topic to form a semantic knowledge graph model;

e. then, reading keyword information input by a user during retrieval, and sequencing and outputting the obtained resources through semantic analysis and retrieval by utilizing the semantic knowledge map model established in the step d;

f. returning the finally inquired retrieval result to the user;

in the step e, the semantic analysis and retrieval includes direct matching retrieval and semantic similarity calculation matching retrieval, wherein the semantic similarity calculation matching retrieval obtains data content most matched with the keywords through calculation and associated information content corresponding to the retrieved keyword main body query, and performs relevance ranking and output as a whole, and the semantic similarity calculation formula is:

sim _S (m,n)＝α*sim _A (m,n)+β*sim _C (m,n)+γ*sim _L (m,n)

wherein m and n are two different files, alpha, beta and gamma are adjusting parameters with the value range of 0-1, sim _A (m, n) is the attribute correlation contained between profiles m and n, sim _C (m, n) is the maximum semantic cosine distance between files m and n, sim _L (m, n) is the path distance between profiles m and n.

Preferably, in step e, sim _A The formula for the calculation of (m, n) is:

where f (m ≦ n) is the similarity between the same attributes of files m and n, f (m-n) is the number of attributes that file m contains but file n does not, and f (n-m) is the number of attributes that file n contains but file m does not.

Preferably, in step e, sim _C The formula for calculating (m, n) is：

sim _C (m,n)＝cos(m,n)。

Preferably, the sim is _L The formula for the calculation of (m, n) is:

wherein length (m, n) is a path distance parameter between the file m and the file n,

for adjusting the parameters, the value is 1.

Preferably, the semantic knowledge map model further comprises a profile knowledge extraction module and a knowledge storage module, wherein the profile knowledge extraction module comprises atomic information elements of the profile and RDF triple extraction of the profile.

Preferably, the knowledge storage module stores the semantic knowledge map model by using a Neo4j database.

Compared with the prior art, the invention has the advantages and positive effects that,

1. the invention provides a digital archive information correlation retrieval method based on semantics, which increases semantic similarity retrieval on the basis of traditional keyword retrieval to enable the retrieval range to be wider and more comprehensive, thereby enabling the provided data to be comprehensive and accurate, and further solving the technical problems in the existing archive management.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

FIG. 1 is a flow chart of a semantic-based digital archive information association retrieval method provided in embodiment 1;

fig. 2 is a flowchart of the semantic knowledge graph model provided in example 1.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, the present invention will be further described with reference to the accompanying drawings and examples. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and thus the present invention is not limited to the specific embodiments of the present disclosure.

Embodiment 1, as shown in fig. 1 and fig. 2, this embodiment aims to provide a retrieval method with a wider retrieval range and a more comprehensive and accurate retrieval result, and for this purpose, the digital archive information association retrieval method based on semantics provided by this embodiment includes the following steps,

firstly, file information resources are digitized, with the rapid development of informatization, currently, part of existing file data is digitized, and the part of files does not need to be processed, and is mainly subjected to digitized processing by scanning and other modes aiming at paper files so as to facilitate subsequent calculation processing.

In order to accurately find the corresponding file, the content of the file needs to be known to a certain extent, and therefore, the file information resource carries out element classification according to key elements such as events, units and the like of the file, correctly extracts relevant knowledge related to a file body, determines recognized basic words and provides semantic relation between the knowledge to construct RDF triples. The RDF triples are mainly used for endowing attributes and values to the related knowledge and the basic vocabulary in the archive, so that the later calculation is facilitated.

Considering that more accurate acquisition of the related profile is required, the profile needs to be known more, and if the event is taken as a core and the external concept is expanded, more profile information is associated to form an association relationship network. Under the condition of no storage limitation, theoretically, a relation relationship network can be infinitely extended to relate to all related information, in the embodiment, two key factors influencing searching are considered, namely the relation between synonyms and key words, for this reason, firstly, synonym expansion is carried out on RDF triples, the knowledge after expansion is stored in corresponding triples, then, key word correlation matching between different archive information resources is realized according to event topics or nodes extended from the event topics, and a semantic knowledge graph model is formed. The entity extraction mainly includes acquiring atomic information elements of the archive information, including names of persons, names of organizations, geographic locations, dates, and the like. The triple extraction is mainly realized according to a single rule between elements during writing so as to obtain the incidence relation between the atomic information elements and form triple instance data. And the knowledge storage module stores the semantic knowledge map model by adopting a Neo4j database. Thus, a complete semantic knowledge map model is formed.

Then, reading keyword information input by a user during retrieval, sequencing and outputting the acquired resources through semantic analysis and retrieval according to a semantic knowledge graph model, wherein the semantic analysis and retrieval comprises direct matching retrieval and semantic similarity calculation matching retrieval, the direct matching is that the keywords or synonyms can be matched in the retrieval, and sequencing and outputting are performed according to the relevance calculation; if the database does not have the content matched with the search word, performing semantic expansion, performing semantic similarity calculation matching search to obtain data content most matched with the keyword through calculation and related information content corresponding to the searched keyword main body query, performing relevance sequencing on the whole and outputting, wherein the semantic similarity calculation formula is as follows:

sim _S (m,n)＝α*sim _A (m,n)+β*sim _C (m,n)+γ*sim _L (m,n)

sim _A The formula for the calculation of (m, n) is:

where f (m ≦ n) is the similarity between the same attributes of files m and n, f (m-n) is the number of attributes that file m contains but file n does not, and f (n-m) is the number of attributes that file n contains but file m does not. sim _C The formula for the calculation of (m, n) is:

sim _C (m,n)＝cos(m,n)。

sim _L the formula for the calculation of (m, n) is:

wherein length (m, n) is a parameter of the path distance between the jumping of the file m to the file n,

for adjusting the parameters, the value is 1.

Therefore, the retrieval range is enlarged through semantic similarity calculation, so that files meeting the conditions can be retrieved, and the retrieval result is more accurate and comprehensive.

And finally, returning the finally inquired retrieval result to the user.

The above description is only a preferred embodiment of the present invention, and not intended to limit the present invention in other forms, and any person skilled in the art may apply the above modifications or changes to the equivalent embodiments with equivalent changes, without departing from the technical spirit of the present invention, and any simple modification, equivalent change and change made to the above embodiments according to the technical spirit of the present invention still belong to the protection scope of the technical spirit of the present invention.

Claims

1. A digital archive information correlation retrieval method based on semantics is characterized by comprising the following steps:

a. firstly, carrying out digital processing on archive information resources;

d. then, realizing keyword correlation matching between different archive information resources according to the event theme or nodes extended from the event theme to form a semantic knowledge graph model;

f. returning the finally inquired retrieval result to the user;

sim _S (m,n)＝α*sim _A (m,n)+β*sim _C (m,n)+γ*sim _L (m,n)

whereinM and n are two different files, alpha, beta and gamma are adjusting parameters with the value range between 0 and 1, sim _A (m, n) is the attribute correlation contained between profiles m and n, sim _C (m, n) is the maximum semantic cosine distance between files m and n, sim _L (m, n) is the path distance between profiles m and n.

2. The digital archive information correlation retrieval method based on semantics as claimed in claim 1, wherein in the step e, sim _A The formula for the calculation of (m, n) is:

3. The digital archive information correlation retrieval method based on semantics of claim 1, wherein in the e step, sim _C The formula for the calculation of (m, n) is:

sim _C (m,n)＝cos(m,n)。

4. the semantic-based digital archive information association retrieval method of claim 3, wherein the sim is a file name of the digital archive information _L The formula for the calculation of (m, n) is:

for adjusting the parameters, the value is 1.

5. The semantic-based digital archive information association retrieval method of claim 4, wherein the semantic knowledge map model further comprises an archive knowledge extraction module and a knowledge storage module, wherein the archive knowledge extraction module comprises atomic information elements of the archive and RDF triple extraction of the archive.

6. The semantic-based digital archive information association retrieval method of claim 5, wherein the knowledge storage module stores the semantic knowledge map model by using a Neo4j database.