CN107480195B

CN107480195B - Indexing relation-based ancient literature unified logic retrieval method

Info

Publication number: CN107480195B
Application number: CN201710574556.5A
Authority: CN
Inventors: 邵玉斌; 朱小妮; 杨美菊; 王逍翔; 曹云
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2017-07-14
Filing date: 2017-07-14
Publication date: 2020-07-10
Anticipated expiration: 2037-07-14
Also published as: CN107480195A

Abstract

The invention relates to an index relation-based ancient literature unified logic retrieval method, in particular to a method for extracting any logic relation contained in a text string input in text retrieval to perform logic combination, and belongs to the technical field of literature retrieval. The invention specifically comprises the following patents: constructing an index system; counting the occurrence times of sentences with fixed sentence lengths; establishing corresponding rules for the logic relations; extracting the logic relation contained in the input text string; combining the rules therein; and displaying and outputting the six technical steps. The method can meet different retrieval requirements of the user based on the understanding of the logical relationship, and greatly improves the user experience.

Description

Indexing relation-based ancient literature unified logic retrieval method

Technical Field

The invention relates to an indexing-relation-based ancient literature uniform logic retrieval method, and belongs to the technical field of literature retrieval.

Background

Ancient literature data is a storage of mass information, how to obtain information meeting the needs of users through reasonable and quick retrieval, and research different objects of ancient literature automatically in a computer mode to discover changes and further acquire valuable knowledge. Because the language is greatly different relative to different cultures of different countries, the search for setting specific ancient documents aiming at China is very important, and a foundation is laid for knowledge discovery.

The existing patents related to retrieval are mostly focused on the rapid retrieval of information on the Internet, and the retrieval research on ancient documents is less; for example, application publication No.: CN 105989030 a, a text retrieval method and device proposed by applicant; in the patent, the text input by the user is divided into words, each keyword is displayed, and then the user selects the keyword to search, so that the information on the internet is only quickly searched, and the ancient documents cannot be quickly searched and various objects cannot be researched and analyzed.

For example, application publication No.: CN 105354325 a, a document retrieval and analysis system proposed by the applicant, which is provided with a basic retrieval module, wherein the retrieval module is used for retrieving in a structured database; setting an expansion retrieval module, wherein the retrieval module is used for searching by combining with natural language processing according to a user request; setting a multi-source integrated retrieval module, wherein the retrieval module is used for integrating multiple data sources of a patent database and performing cross-database retrieval on users; although this patent outputs results more conforming to the user's requirements through multi-aspect association, the setup cannot perform statistical analysis studies on various objects of ancient documents.

Disclosure of Invention

The technical problem to be solved by the invention is to provide an indexing relation-based ancient document unified logic retrieval method, which is mainly used for solving the retrieval problem of ancient documents and laying a foundation for knowledge discovery.

The technical scheme adopted by the invention is as follows: a method for searching ancient literature unification logic based on index relationship comprises the following steps:

1) constructing an index system:

reading a text;

establishing a first index table, wherein the first index table comprises a document number and a document name corresponding to the document number;

establishing a second index table, wherein the second index table comprises different characters in all documents and the documents in which the characters appear;

establishing a third index table, wherein the third index table comprises all different characters in each document and the positions of the characters;

writing the first index table, the second index table and the third index table into an index file for storage;

2) counting the occurrence times of sentences with fixed sentence length:

reading a third index table;

because the period, question mark and exclamation mark represent the pause at the end of the period, the index information of the period, question mark and exclamation mark in each document can be obtained by reading the third index table and is respectively marked as A, B, C, wherein the corresponding relation in A, B, C is as follows: a [ a1, a2, a 3. cndot. an ], B [ B1, B2, B3. cndot. bn ], C [ C1, C2, C3. cndot. cng ], A1< a2< a 3. cndot. a, B1< B2. cndot. cng, C1. cndot. cng 2. cndot., (B1. bn), (C1. cndot. cng) are mutually exclusive, the mark A, B, C represents an equal question mark, the question mark 36 1-cndot, and the three-index mark shows the appearance in the appearance position of the three-index table 1, the three index mark shows the appearance of the three-index mark 1-cndot sequence;

the already ordered A, B, C are merged to define D, E set:

a, B is merged, each sequence maintains a position pointer, and two pointers are moved backwards in two lists at the same time, the beginning a1 and b1 of the two sequences are respectively taken to be compared, if a1 is less than b1, D [ a1, b1], the pointers are respectively moved backwards by one bit, a2 and b2 are taken to be compared, if b2 is less than a2, D [ a1, b1, b2], the pointers corresponding to the small arrays are moved backwards by one bit, namely, b3 and a2 are compared, sorting is carried out according to the sequence from small to large until A, B numbers in the two sequences are all taken out, then the numbers in the sequence C and the numbers in the sequence D are compared again according to the principle, and stored in the set E, and then A, B, C is merged into a set E which is arranged according to the size sequence;

set E [ E1, E2, E3. cndot. n ] wherein E1< E2< E3. cndot. F, F is defined as: f [ e2-e1, e3-e2, e4-e3, en-e (n-1) ];

counting the occurrence times of the same numerical values in the set F;

3) establishing a corresponding rule for the logic relation:

establishing intersection, and for the character x and the character y, setting the interval set of x as follows:

{x1∈{a1＜x1＜b1},x2∈{a2＜x2＜b2},x3∈{a3＜x3＜b3},······,xn∈{an＜xn＜bn}}

wherein the set of intervals for y:

{ y1 ∈ { c1< y 1< d1}, y2 ∈ { c2< y 2< d2}, y3 ∈ { c3< y 3< d3}, · · · · · ·, yn ∈ { cn < yn < dn } }, a2 ═ c2, b2 ═ d2, a3 ═ c3, b3 ═ d3, a5 ═ c5, b5 ═ d5}, and so on

Then x ∩ y { { a2< x < b2}, { a3< x < b3}, { a5 < x < b5}

Or x ∩ y { { c2< y < d2}, { c3< y < d3}, { c5 < y < d5} };

intersection of the intersections: given the intersection set up, it is known that,

z ∈ { y2-x2, y3-x3, y5-x5} and y2-x 2-y 5-x 5-c, where z represents a set of differences for characters in the same interval,

x∩y＝{{a2＜x＜b2},{a3＜x＜b3},{a5＜x＜b5}}∩{z∈{y2-x2,y5-x5}}，x∈{a＜x＜b}∩y∈{c＜y＜d}∩{b-a＝c}；

difference set 1: knowing the established intersection, then

{x1∈{a1＜x1＜b1},x2∈{a2＜x2＜b2},x3∈{a3＜x3＜b3},······,xn∈{an＜xn＜bn}}- x∩y＝{{a2＜x2＜b2},{a3＜x3＜b3},{a5＜x5＜b5}}＝ {x1∈{a1＜x1＜b1},x4∈{a4＜x4＜b4},x6∈{a6＜x6＜b6},······,xn∈{an＜xn＜bn}}；

Difference set 2: knowing the established intersection, then

{y1∈{c1＜y1＜d1},y2∈{c2＜y2＜d2},x3∈{c3＜y3＜d3},······,yn∈{cn＜yn＜dn}}- x∩y＝{{c2＜y2＜d2},{a3＜y3＜d3},{c5＜y5＜d5}}＝ {y1∈{c1＜y1＜d1},y4∈{c4＜y4＜d4},y6∈{c6＜y6＜d6},······,yn∈{cn＜yn＜dn}}；

4) Extracting the logic relation contained in the input text string:

from step 2), 3):

x ^ y, which indicates that x and y exist in the same sentence;

indicates that x is present or not in the same sentence;

indicating that y is present or not in the same sentence;

yi-xi ═ p, which means that the difference between y and x in the same sentence is a constant p;

yi-xi > p, which means that the difference between y and x in the same sentence is larger than a constant p;

yi-xi < p, which means that the difference between y and x in the same sentence is less than a constant p;

bi-ai ═ Q, meaning that a sentence length equals a constant Q;

bi-ai > Q, meaning that a sentence length is greater than a constant Q;

bi-ai < Q, which means that the length of a sentence is less than a constant Q;

5) combining the rules:

indicates that within the same sentence there is both x and y and no z;

(yi-xi) ═ P ^ (bi-ai) ═ Q, meaning that the difference between y and x in the same sentence is P, sentence length is Q;

(yi-xi) ═ P ^ (bi-ai) > Q, which indicates in the same sentence, the difference between y and x is P, and the sentence length is greater than Q;

(yi-xi) ═ P ^ (bi-ai) < Q, which indicates in the same sentence, the difference between y and x is P, and the sentence length is less than Q;

(yi-xi) > P ^ (bi-ai) ═ Q, which indicates that in the same sentence, the difference between y and x is greater than P, and the sentence length is Q;

(yi-xi) > P ^ (bi-ai) > Q, which indicates that in the same sentence, the difference value between y and x is greater than P, and the sentence length is greater than Q;

(yi-xi) > P ^ (bi-ai) < Q, which indicates that in the same sentence, the difference value between y and x is greater than P, and the sentence length is less than Q;

(yi-xi) < P ^ (bi-ai) ═ Q, which indicates in the same sentence that the difference between y and x is less than P, and the sentence length is Q;

(yi-xi) < P ^ (bi-ai) > Q, which indicates that in the same sentence, the difference value between y and x is less than P, and the sentence length is greater than Q;

(yi-xi) < P ^ (bi-ai) < Q, which indicates that in the same sentence, the difference value between y and x is less than P, and the sentence length is less than Q;

6) and (4) displaying and outputting the result:

extracting the logic relations according to the step 4), combining the logic relations according to the step 5), inquiring in the index table according to the step 1), and displaying the inquiring result.

The invention has the beneficial effects that: the invention provides an efficient and reasonable calculation method aiming at the retrieval problem of ancient documents, which not only realizes the retrieval problem aiming at the ancient documents in China, but also can automatically count some information, and lays a foundation for knowledge discovery by defining some rules for logical relations and combining the rules.

Drawings

FIG. 1 is a flow chart of the present invention patent building indexing system;

fig. 2 is a general flow chart in the present patent.

Detailed Description

The invention is further described with reference to the following drawings and detailed description.

Example 1: as shown in fig. 1 and 2, a method for unified logical search of ancient documents based on index relationship includes the following steps:

1) constructing an index system:

reading a text;

2) counting the occurrence times of sentences with fixed sentence length:

reading a third index table;

because the period, question mark and exclamation mark represent the pause at the end of the period, the index information of the period, question mark and exclamation mark in each document can be obtained by reading the third index table and is respectively marked as A, B, C, wherein the corresponding relation in A, B, C is as follows: a [ a1, a2, a 3. cndot. cndot. ], B [ B1, B2, B3. cndot. cndot. ], C [ C1, C2, C3. cndot. cndot. ], a1< a2< a 3. cndot. cndot. ], B1< B2. cndot. C1< C2. cndot. cndot., (a 1. cndot.), (B1. bn), (C1. cndot. dot. cndot. cndot., (a1, respectively) represent non-index points, and the occurrence of the three-index No. cndot. 1 represents the occurrence positions;

the already ordered A, B, C are merged to define D, E set:

set E [ E1, E2, E3. cndot. n ] wherein E1< E2< E3. cndot. F, defined as set F: f [ e2-e1, e3-e2, e4-e3, en-e (n-1) ];

counting the occurrence times of the same numerical values in the set F;

3) establishing a corresponding rule for the logic relation:

{ x1 ∈ { a1< x 1< b1}, x2 ∈ { a2< x 2< b2}, x3 ∈ { a3< x 3< b3}, · · · · · · · ·, xn ∈ { an < xn < bn } } where the set of intervals for y:

Then x ∩ y { { a2< x < b2}, { a3< x < b3}, { a5 < x < b5}

Or x ∩ y { { c2< y < d2}, { c3< y < d3}, { c5 < y < d5} };

x∩y＝{{a2＜x＜b2},{a3＜x＜b3},{a5＜x＜b5}}∩{z∈{y2-x2,y5-x5}}， x∈{a＜x＜b}∩y∈{c＜y＜d}∩{b-a＝c}；

difference set 1: knowing the established intersection, then

Difference set 2: knowing the established intersection, then

4) Extracting the logic relation contained in the input text string:

from step 2), 3):

x ^ y, which indicates that x and y exist in the same sentence;

indicates that x is present or not in the same sentence;

indicating that y is present or not in the same sentence;

bi-ai ═ Q, meaning that a sentence length equals a constant Q;

bi-ai > Q, meaning that a sentence length is greater than a constant Q;

bi-ai < Q, which means that the length of a sentence is less than a constant Q;

5) combining the rules:

indicates that within the same sentence there is both x and y and no z;

6) and (4) displaying and outputting the result:

For example, the following steps are carried out: as shown in fig. 1, taking four famous titles as an example, an index system is constructed, a text is read, and a first index table is established, where the first index table includes a document number and a document name corresponding to the document number; the first index table is as in table 1:

table 1:

document numbering	Document name
		DocID_0	Txt, a three kingdoms speech meaning
DocID_1	Txt of water transfer
		DocID_2	Dream of Red mansions txt
DocID_3	Txt for journey to the West
		…….	……

Aiming at the information of the first index table, a second index table is established by adopting the method in the embodiment 1, the second index table comprises different words in all documents and the words appearing in the documents, and the number of times of the words appearing can be counted; taking a part of the words, the second index table is as shown in table 2:

table 2:

for the second index table, a third index table is created by the method in embodiment 1, the third index table includes all the different characters and the positions of the characters in the document for dream of red mansions, and the third index table is shown in table 3:

table 3:

the method in embodiment 1 is adopted to count the times of appearance of a sentence with a fixed sentence length, and index information of a period, a question mark and an exclamation mark is obtained by reading an index table three, as shown in table 4:

table 4:

the positions of punctuation marks are sorted in order of magnitude by the method in example 1, for a total of 34390 positions, and the positions are sorted to obtain table 5:

table 5:

the length of the sentence is counted using the method in example 1, as shown in table 6:

table 6:

the number of occurrences of a fixed sentence length was counted using the method of example 1, as shown in table 7:

table 7:

the method in embodiment 1 is used to establish corresponding rules for the logical relationships, and establish intersection, intersection of intersection, difference 1, difference 2, interval set of character "because" is shown in table 8, interval set of character "so" is shown in table 9, and interval set of both character "because" and character "so" in the same sentence is shown in table 10:

table 8:

interval(s)	"because" position
		[122314,122338]	[122317]
[123276,123335]	[123307]
		[253308,253339]	[253331]
[255769,255802]	[255784]
		……	……
[91142,91171]	[91158]

TABLE 9

Interval(s)	Position of "so
		[101437,101480]	[101471]
[108878,108926]	[108918]
		[111389,111416]	[111372]
[255769,255802]	[255794]
		……	……
[99754,99836]	[99829]

Watch 10

Intersection interval	"because" position	Position of "so
			[255769,255802]	[255784]	[255794]
[398953,399003]	[398956]	[398996]
			[459956,460018]	[459964]	[459988]
[515751,515794]	[515755]	[515780]
			[66039,66085]	[66001]	[66028]
[749675,749746]	[749685]	[749697]
			[91142,91171]	[91158]	[91166]

With the embodiment 1, the logical relationship contained in the input text string is extracted, as shown in table 11: table 11:

the rules in example 1 were used in combination, as shown in table 12:

table 12:

in embodiment 1, the result of the query is displayed and output, and the result matching the query is displayed according to table 13. Table 13:

and combining the table 13, splitting the input character string into substrings meeting the rules, and finally displaying and outputting the result meeting the conditions.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims

1. A method for unified logical retrieval of ancient documents based on index relationship is characterized in that: the method comprises the following steps:

1) constructing an index system:

reading a text;

2) counting the occurrence times of sentences with fixed sentence length:

reading a third index table;

because the period, question mark and exclamation mark represent the pause at the end of the period, the index information of the period, question mark and exclamation mark in each document can be obtained by reading the third index table and is respectively marked as A, B, C, wherein the corresponding relation in A, B, C is as follows: a [ a1, a2, a3 … … an ], B [ B1, B2, B3 … … bn ], C [ C1, C2, C3 … … cn ], A: a1< a2< a3< … … < an, B: B1< B2< B3< … … < bn, C: C1< C2< C3< … … < cn and (a1 … an), (B1 … bn), (C1 … cn) are mutually different, A, B, C respectively represent punctuation marks, question marks, exclamation marks, a1-an represent positions where periods appear in the third index, B1-bn represent positions where question marks appear in the third index table, C1-cn represent positions where exclamation marks appear in the third index table;

the already ordered A, B, C are merged to define D, E set:

a, B is merged, each sequence maintains a position pointer, and two pointers are moved backwards in two lists at the same time, the beginning a1 and b1 of the two sequences are respectively taken to be compared, if a1 is less than b1, D [ a1, b1], the pointers are respectively moved backwards by one bit, a2 and b2 are taken to be compared, if b2 is less than a2, D [ a1, b1, b2], the pointers corresponding to the small array are moved backwards by one bit, namely, b3 and a2 are compared, sorting is carried out according to the sequence from small to large until A, B numbers in the two sequences are all taken out, then the numbers in the sequence C and the numbers in the sequence D are compared again according to the principle that A, B is merged into D, and stored into a set E, thus A, B, C is merged into a set E which is arranged according to the sequence of large and small;

set E [ E1, E2, E3 … … en ] wherein E1< E2< E3< … … < en defines set F which is: f [ e2-e1, e3-e2, e4-e3, … …, en-e (n-1) ];

counting the occurrence times of the same numerical values in the set F;

3) establishing a corresponding rule for the logic relation:

{x1∈{a1＜x1＜b1},x2∈{a2＜x2＜b2},x3∈{a3＜x3＜b3},……,xn∈{an＜xn＜bn}}

wherein the set of intervals for y:

{y1∈{c1＜y1＜d1},y2∈{c2＜y2＜d2},y3∈{c3＜y3＜d3},……,yn∈{cn＜yn＜dn}}

let a2 ═ c2, b2 ═ d 2; a3 ═ c3, b3 ═ d 3; a5 ═ c5, b5 ═ d5

Then x ∩ y { { a2< x < b2}, { a3< x < b3}, { a5 < x < b5}

Or x ∩ y { { c2< y < d2}, { c3< y < d3}, { c5 < y < d5} };

difference set 1: knowing the established intersection, then

{x1∈{a1＜x1＜b1},x2∈{a2＜x2＜b2},x3∈{a3＜x3＜b3},……,xn∈{an＜xn＜bn}}-x∩y＝{{a2＜x2＜b2},{a3＜x3＜b3},{a5＜x5＜b5}}＝{x1∈{a1＜x1＜b1},x4∈{a4＜x4＜b4},x6∈{a6＜x6＜b6},……,xn∈{an＜xn＜bn}}；

Difference set 2: knowing the established intersection, then

{y1∈{c1＜y1＜d1},y2∈{c2＜y2＜d2},x3∈{c3＜y3＜d3},……,yn∈{cn＜yn＜dn}}-x∩y＝{{c2＜y2＜d2},{a3＜y3＜d3},{c5＜y5＜d5}}＝{y1∈{c1＜y1＜d1},y4∈{c4＜y4＜d4},y6∈{c6＜y6＜d6},……,yn∈{cn＜yn＜dn}}；

4) Extracting the logic relation contained in the input text string:

from step 2), 3):

x ^ y, which indicates that x and y exist in the same sentence;

indicates that x is present or not in the same sentence;

indicating that y is present or not in the same sentence;

bi-ai ═ Q, meaning that a sentence length equals a constant Q;

bi-ai > Q, meaning that a sentence length is greater than a constant Q;

bi-ai < Q, which means that the length of a sentence is less than a constant Q;

5) combining the rules:

indicates that within the same sentence there is both x and y and no z;

(yi-xi) < P ^ (bi-ai) = Q, which means in the same sentence, the difference between y and x is less than P, and the sentence length is Q;

6) and (4) displaying and outputting the result: