CN107480195A

CN107480195A - A kind of ancient documents uniform logical search method based on index relative

Info

Publication number: CN107480195A
Application number: CN201710574556.5A
Authority: CN
Inventors: 邵玉斌; 朱小妮; 杨美菊; 王逍翔; 曹云
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2017-07-14
Filing date: 2017-07-14
Publication date: 2017-12-15
Anticipated expiration: 2037-07-14
Also published as: CN107480195B

Abstract

The present invention relates to a kind of ancient documents uniform logical search method based on index relative, specific saying is that the text string inputted in text retrieval extracts any logical relation progress logical combination included, belongs to literature search technical field.The present invention specifically includes：Build directory system；Count the number that the sentence of fixed sentence length occurs；Corresponding rule is established to logical relation；The logical relation included in it is extracted to the text string of input；Rule therein is combined；As a result more than display output six technical steps.The understanding that this method is capable of logic-based relation meets the different Search Requirement of user, is greatly improved Consumer's Experience.

Description

A kind of ancient documents uniform logical search method based on index relative

Technical field

The present invention relates to a kind of ancient documents uniform logical search method based on index relative, belongs to literature search technology neck Domain.

Background technology

Ancient documents data are a kind of storages of magnanimity information, how to be obtained by rational quickly retrieval and meet to use The information that family needs, and automatically the different objects of ancient documents are studied by using the mode of computer, find one A little changes, and then obtain some valuable knowledge.Because relative to the Different Culture of country variant, there is also very big for language Difference, therefore, set specifically for China ancient documents retrieval it is most important, laid the foundation for Knowledge Discovery.

It is existing about retrieval in terms of patent focus mostly on for the information quick-searching on internet, and be directed to ancient documents Retrieval research it is fewer；Such as application publication number：The A of CN 105989030, a kind of text retrieval retrieval side that applicant proposes Method and device；Participle division is carried out by the text inputted to user in that patent, shows each keyword, then again by with Family is gone to select keyword therein to be retrieved, and only realizes and quick-searching is carried out to the information on internet, and can not be directed to Gu The quick-searching of document and various objects are researched and analysed.

Such as application publication number：The A of CN 105354325, a kind of literature search and analysis system that applicant proposes, this is special For profit by setting basic retrieval module, the retrieval module is retrieved in the database of structuring；Set and expand retrieval mould Block, the retrieval module are to ask unified with nature Language Processing to scan for according to user；Multi-source Aggregated search module, the inspection are set Rope module is the multi-data source integration and the cross search of user to patent database；Although the patent is associated by many-side Output more meets the result of user's requirement, but sets the various objects that can not be directed to ancient documents to carry out Research on Statistics and Analysis.

The content of the invention

The technical problem to be solved in the present invention is to provide a kind of ancient documents uniform logical search method based on index relative, It is mainly used in the search problem for solving ancient documents, is laid a good foundation for Knowledge Discovery.

The technical solution adopted by the present invention is：A kind of ancient documents uniform logical search method based on index relative, including Following step：

1) directory system is built:

Read text；

The first concordance list is established, the first concordance list includes document title corresponding to document code and the document numbering；

The second concordance list is established, the second concordance list includes character different in all documents and which text the character appears in In shelves；

The 3rd concordance list is established, the 3rd concordance list includes the position of all different characters and the character in each document Put；

First concordance list, the second concordance list, the 3rd concordance list are write in index file and preserved；

2) number that the sentence of fixed sentence length occurs is counted：

Read the 3rd concordance list；

Because fullstop, question mark, exclamation mark represent the pause of end of the sentence, by reading the 3rd concordance list, each document can be obtained Middle fullstop, question mark, the index information of exclamation mark, A, B, C are designated as respectively, the wherein corresponding relation in A, B, C is： A[a1,a2, A3an], B [b1, b2, b3bn], C [c1, c2, c3cn], Α:a1< a2<a3<······<an、B:b1<b2<b3<······<bn、C:c1<c2<c3<······<Cn and (a1an), (b1bn), (c1cn) are not mutually equal, A, B, C represent respectively punctuation mark fullstop, Question mark, exclamation mark, a1-an represent the position that fullstop occurs in the 3rd index, and b1-bn represents question mark in the 3rd concordance list The position of appearance, c1-cn represent the position that exclamation mark occurs in the 3rd concordance list；

Sorted A, B, C are merged, define D, E set：

A, B are merged first, each sequence safeguards a position indicator pointer, and allows two pointers simultaneously in two row In table after move, take the beginning a1 of two sequences respectively compared with b1, if a1 ＜ b1, D [a1, b1], pointer respectively to One is moved afterwards, a2 is taken compared with b2, if b2 ＜ a2, D [a1, b1, b2], by the finger corresponding to that small array After pin move one i.e. b3 and a2 contrasted, be ranked up according to order from small to large, until the number in two sequences of A, B all Take, then by the number in the number and sequence D in sequence C, compared again according to mentioned above principle, in deposit set E, such general A, B, C merge into a set E arranged according to size order；

Set E [e1, e2, e3en] wherein E:e1<e2<e3<······<En, definition set F, F are：F [e2-e1, e3-e2, e4-e3, en-e (n-1)]；

The number that identical numerical value occurs in statistics set F；

3) corresponding rule is established to logical relation：

Establish and occur simultaneously, for character x and character y, wherein x section is gathered：

X1 ∈ { a1 ＜ x1 ＜ b1 }, x2 ∈ { a2 ＜ x2 ＜ b2 }, and x3 ∈ { a3 ＜ x3 ＜ b3 }, xn ∈ { an ＜ xn ＜ bn } }

Wherein y section set：

Y1 ∈ { c1 ＜ y1 ＜ d1 }, y2 ∈ { c2 ＜ y2 ＜ d2 }, and y3 ∈ { c3 ＜ y3 ＜ d3 }, yn ∈ { cn ＜ yn ＜ dn } } set a2=c2, b2=d2；A3=c3, b3=d3；A5=c5, b5=d5

Then x ∩ y={ { a2 ＜ x ＜ b2 }, { a3 ＜ x ＜ b3 }, { a5 ＜ x ＜ b5 } }

Or x ∩ y={ { c2 ＜ y ＜ d2 }, { c3 ＜ y ＜ d3 }, { c5 ＜ y ＜ d5 } }；

The common factor of common factor：The common factor of known foundation,

Z ∈ { y2-x2, y3-x3, y5-x5 } and y2-x2=y5-x5=c, wherein z represent difference of the character in same section One set,

X ∩ y={ { a2 ＜ x ＜ b2 }, { a3 ＜ x ＜ b3 }, { a5 ＜ x ＜ b5 } } ∩ { z ∈ { y2-x2, y5-x5 } }, x ∈ { a ＜ x ＜ b } ∩ y ∈ { c ＜ y ＜ d } ∩ { b-a=c }；

Difference set 1：The common factor of known foundation, then

X1 ∈ { a1 ＜ x1 ＜ b1 }, x2 ∈ { a2 ＜ x2 ＜ b2 }, and x3 ∈ { a3 ＜ x3 ＜ b3 }, xn ∈ { an ＜ xn ＜ bn } }-x ∩ y={ { a2 ＜ x2 ＜ b2 }, { a3 ＜ x3 ＜ b3 }, { a5 ＜ x5 ＜ b5 } }={ x1 ∈ { a1 ＜ x1 ＜ b1 }, x4 ∈ { a4 ＜ x4 ＜ b4 }, x6 ∈ { a6 ＜ x6 ＜ b6 }, xn ∈ { an ＜ xn ＜ bn } }；

Difference set 2：The common factor of known foundation, then

Y1 ∈ { c1 ＜ y1 ＜ d1 }, y2 ∈ { c2 ＜ y2 ＜ d2 }, and x3 ∈ { c3 ＜ y3 ＜ d3 }, yn ∈ { cn ＜ yn ＜ dn } }-x ∩ y={ { c2 ＜ y2 ＜ d2 }, { a3 ＜ y3 ＜ d3 }, { c5 ＜ y5 ＜ d5 } }={ y1 ∈ { c1 ＜ y1 ＜ d1 }, y4 ∈ { c4 ＜ y4 ＜ d4 }, y6 ∈ { c6 ＜ y6 ＜ d6 }, yn ∈ { cn ＜ yn ＜ dn } }；

4) logical relation included in it is extracted to the text string of input：

By step 2), 3) know：

X ∧ y, represent that existing x has y again in same sentence；

Expression has x without y in same sentence；

Expression has y without x in same sentence；

Yi-xi=p, it is a constant p to represent the difference in same sentence between y and x；

Yi-xi ＞ p, represent that the difference in same sentence between y and x is more than a constant p；

Yi-xi ＜ p, represent that the difference in same sentence between y and x is less than a constant p；

Bi-ai=Q, represent that a sentence length is equal to a constant Q；

Bi-ai ＞ Q, represent that a sentence length is more than a constant Q；

Bi-ai ＜ Q, represent that a sentence length is less than a constant Q；

5) rule therein is combined：

Represent that existing x has y without z again in same a word；

(yi-xi)=p ∧ (bi-ai)=Q, represent that the difference in same sentence between y and x is P, sentence length is Q；

(yi-xi)=p ∧ (bi-ai) ＞ Q, are represented in same sentence, and the difference between y and x is P, sentence length More than Q；

(yi-xi)=p ∧ (bi-ai) ＜ Q, are represented in same sentence, and the difference between y and x is P, sentence length Less than Q；

(yi-xi) ＞ p ∧ (bi-ai)=Q, is represented in same sentence, and the difference between y and x is more than P, sentence length Spend for Q；

(yi-xi) ＞ p ∧ (bi-ai) ＞ Q, are represented in same sentence, and the difference between y and x is more than P, sentence length Degree is more than Q；

(yi-xi) ＞ p ∧ (bi-ai) ＜ Q, are represented in same sentence, and the difference between y and x is more than P, sentence length Degree is less than Q；

(yi-xi) ＜ p ∧ (bi-ai)=Q, is represented in same sentence, and the difference between y and x is less than P, sentence length Spend for Q；

(yi-xi) ＜ p ∧ (bi-ai) ＞ Q, are represented in same sentence, and the difference between y and x is less than P, sentence length Degree is more than Q；

(yi-xi) ＜ p ∧ (bi-ai) ＜ Q, are represented in same sentence, and the difference between y and x is less than P, sentence length Degree is less than Q；

6) result display output：

Possessed logical relation is extracted according to step 4), logical relation is combined according to step 5), according to step It is rapid 1) to be inquired about in concordance list, Query Result is shown.

The beneficial effects of the invention are as follows：Patent of the present invention is directed to the search problem of ancient documents, it is proposed that a kind of efficient, reasonable Computational methods, not only realize a kind of the problem of offering retrieval for ancient Chinese prose, and can automatically count some information, it is logical Cross and some rules are defined to logical relation, and these rules are combined, laid a good foundation for Knowledge Discovery.

Brief description of the drawings

Fig. 1 is the flow chart of patent structure directory system of the present invention；

Fig. 2 is the overview flow chart in patent of the present invention.

Embodiment

With reference to the accompanying drawings and detailed description, the present invention is described further.

Embodiment 1：As shown in Figure 1, 2, a kind of ancient documents uniform logical search method based on index relative, including it is following Step：

1) directory system is built:

Read text；

2) number that the sentence of fixed sentence length occurs is counted：

Read the 3rd concordance list；

Because fullstop, question mark, exclamation mark represent the pause of end of the sentence, by reading the 3rd concordance list, each document can be obtained Middle fullstop, question mark, the index information of exclamation mark, A, B, C are designated as respectively, the wherein corresponding relation in A, B, C is： A[a1,a2, A3an], B [b1, b2, b3bn], C [c1, c2, c3cn], Α:a1< A2 ＜ a3 ＜<an、B:b1<b2<b3<······<bn、C:c1<c2<c3<······<cn And (a1an), (b1bn), (c1cn) are not mutually equal (i.e.：The numerical value that these letters represent all is not Equal), A, B, C represent punctuation mark fullstop, question mark, exclamation mark respectively, and a1-an represents that fullstop occurs in the 3rd index Position, b1-bn represents the position that occurs in the 3rd concordance list of question mark, and c1-cn represents that exclamation mark goes out in the 3rd concordance list Existing position；

Sorted A, B, C are merged, define D, E set：

Set E [e1, e2, e3en] wherein E:E1 ＜ e2<e3<······<En, definition set F, F are：F [e2-e1, e3-e2, e4-e3, en-e (n-1)]；

The number that identical numerical value occurs in statistics set F；

3) corresponding rule is established to logical relation：

X1 ∈ { a1 ＜ x1 ＜ b1 }, x2 ∈ { a2 ＜ x2 ＜ b2 }, and x3 ∈ { a3 ＜ x3 ＜ b3 }, xn ∈ { an ＜ xn ＜ bn } } wherein y section set：

Then x ∩ y={ { a2 ＜ x ＜ b2 }, { a3 ＜ x ＜ b3 }, { a5 ＜ x ＜ b5 } }

Or x ∩ y={ { c2 ＜ y ＜ d2 }, { c3 ＜ y ＜ d3 }, { c5 ＜ y ＜ d5 } }；

The common factor of common factor：The common factor of known foundation,

Difference set 1：The common factor of known foundation, then

Difference set 2：The common factor of known foundation, then

4) logical relation included in it is extracted to the text string of input：

By step 2), 3) know：

X ∧ y, represent that existing x has y again in same sentence；

Expression has x without y in same sentence；

Expression has y without x in same sentence；

Bi-ai=Q, represent that a sentence length is equal to a constant Q；

Bi-ai ＞ Q, represent that a sentence length is more than a constant Q；

Bi-ai ＜ Q, represent that a sentence length is less than a constant Q；

5) rule therein is combined：

Represent that existing x has y without z again in same a word；

6) result display output：

Illustrate：As shown in figure 1, by taking four great classical masterpieces as an example, directory system is built, text is read, establishes the first index Table, the first concordance list include document title corresponding to document code and the document numbering；First concordance list such as table 1：

Table 1：

Document code	Document title
		DocID_0	The Romance of the Three Kingdoms .txt
DocID_1	Water Margin .txt
		DocID_2	A Dream of Red Mansions .txt
DocID_3	Journey to the West .txt
		…….	……

For above-mentioned first index table information, the second concordance list, the second concordance list bag are established using the method in embodiment 1 Include word different in all documents and the word is appeared in those documents, moreover it is possible to count the word and how many times occur；Take therein A part of word, the second concordance list such as table 2：

Table 2：

For above-mentioned second concordance list, the 3rd concordance list is established using the method in embodiment 1, the 3rd concordance list includes should It is as shown in table 3 for all different characters in A Dream of Red Mansions document and the position of the character, the 3rd concordance list：

Table 3：

The number that the sentence grown using the method statistic fixed sentence in embodiment 1 is occurred, is obtained by reading concordance list three Fullstop, question mark, the index information of exclamation mark, such as table 4：

Table 4：

The position of punctuation mark is ranked up according to size order using the method in embodiment 1, a total of 34390 Put, position is ranked up, obtain table 5：

Table 5：

Using the length of the method statistic sentence in embodiment 1, as shown in table 6：

Table 6：

The number of sentence length appearance is fixed using the method statistic in embodiment 1, as shown in table 7：

Table 7：

Corresponding rule is established to logical relation using the method in embodiment 1, establishes and occurs simultaneously, the common factor of common factor is poor Collection 1, difference set 2, character " because " Interval Set be combined into shown in table 8, character " so " Interval Set is combined into shown in table 9, same Existing character in sentence " because " have again character " " Interval Set be combined into shown in table 10：

Table 8：

Section	" because " position
		[122314,122338]	[122317]
[123276,123335]	[123307]
		[253308,253339]	[253331]
[255769,255802]	[255784]
		……	……
[91142,91171]	[91158]

Table 9

Section	" so " position
		[101437,101480]	[101471]
[108878,108926]	[108918]
		[111389,111416]	[111372]
[255769,255802]	[255794]
		……	……
[99754,99836]	[99829]

Table 10

Common factor section	" because " position	" so " position
			[255769,255802]	[255784]	[255794]
[398953,399003]	[398956]	[398996]
			[459956,460018]	[459964]	[459988]
[515751,515794]	[515755]	[515780]
			[66039,66085]	[66001]	[66028]
[749675,749746]	[749685]	[749697]
			[91142,91171]	[91158]	[91166]

Using in embodiment 1, the logical relation included in it is extracted to the text string of input, as shown in table 11：Table 11：

Using in embodiment 1, rule therein is combined, such as table 12：

Table 12：

Using in embodiment 1, display output is carried out to the result of inquiry, according to table 13, inquiry is met the progress of result Display.Table 13：

With reference to table 13, the character string of input is split into legal substring, finally the result for the condition that meets carried out Display output.

Above in association with accompanying drawing to the present invention embodiment be explained in detail, but the present invention be not limited to it is above-mentioned Embodiment, can also be before present inventive concept not be departed from those of ordinary skill in the art's possessed knowledge Put that various changes can be made.

Claims

A kind of 1. ancient documents uniform logical search method based on index relative, it is characterised in that：Comprise the steps：

1) directory system is built：

Read text；

The first concordance list is established, the first concordance list includes document title corresponding to document code and the document numbering；

The second concordance list is established, the second concordance list includes character different in all documents and which document the character appears in In；

The 3rd concordance list is established, the 3rd concordance list includes the position of all different characters and the character in each document；

First concordance list, the second concordance list, the 3rd concordance list are write in index file and preserved；

2) number that the sentence of fixed sentence length occurs is counted：

Read the 3rd concordance list；

Because fullstop, question mark, exclamation mark represent the pause of end of the sentence, by reading the 3rd concordance list, sentence in each document can be obtained Number, the index information of question mark, exclamation mark, be designated as A, B, C respectively, the wherein corresponding relation in A, B, C is：A[a1,a2, A3an], B [b1, b2, b3bn], C [c1, c2, c3cn], Α:A1 ＜ A2 ＜ a3 ＜＜ an, B:B1 ＜ b2 ＜ b3 ＜＜ bn, C:C1 ＜ c2 ＜ c3 ＜＜ cn and (a1an), (b1bn), (c1cn) are not mutually equal, A, B, C difference Punctuation mark fullstop, question mark, exclamation mark are represent, a1-an represents the position that fullstop occurs in the 3rd index, and b1-bn is represented The position that question mark occurs in the 3rd concordance list, c1-cn represent the position that exclamation mark occurs in the 3rd concordance list；

Sorted A, B, C are merged, define D, E set：

A, B are merged first, each sequence safeguards a position indicator pointer, and allows two pointers simultaneously in two lists After move, taking the beginning a1 of two sequences respectively, if, a1 ＜ b1, D [a1, b1], pointer moves respectively backward compared with b1 It is dynamic one, a2 is taken compared with b2, if b2 ＜ a2, D [a1, b1, b2], after the pointer corresponding to that small array Move an i.e. b3 and a2 to be contrasted, be ranked up according to order from small to large, until the number in two sequences of A, B all takes It is complete, then by the number in the number and sequence D in sequence C, compared again according to mentioned above principle, in deposit set E, so by A, B, C merges into a set E arranged according to size order；

Set E [e1, e2, e3en] wherein E:E1 ＜ e2 ＜ e3 ＜＜ en, definition set F, F are：F [e2-e1, e3-e2, e4-e3, en-e (n-1)]；

The number that identical numerical value occurs in statistics set F；

3) corresponding rule is established to logical relation：

Establish and occur simultaneously, for character x and character y, wherein x section is gathered：

X1 ∈ { a1 ＜ x1 ＜ b1 }, x2 ∈ { a2 ＜ x2 ＜ b2 }, and x3 ∈ { a3 ＜ x3 ＜ b3 }, xn ∈ { an ＜ xn ＜ bn } }

Wherein y section set：

Y1 ∈ { c1 ＜ y1 ＜ d1 }, y2 ∈ { c2 ＜ y2 ＜ d2 }, and y3 ∈ { c3 ＜ y3 ＜ d3 }, yn ∈ { cn ＜ yn ＜ dn } } set a2=c2, b2=d2；A3=c3, b3=d3；A5=c5, b5=d5

Then x ∩ y={ { a2 ＜ x ＜ b2 }, { a3 ＜ x ＜ b3 }, { a5 ＜ x ＜ b5 } }

Or x ∩ y={ { c2 ＜ y ＜ d2 }, { c3 ＜ y ＜ d3 }, { c5 ＜ y ＜ d5 } }；

The common factor of common factor：The common factor of known foundation,

Z ∈ { y2-x2, y3-x3, y5-x5 } and y2-x2=y5-x5=c, wherein z represent character same section difference one Individual set,

X ∩ y={ { a2 ＜ x ＜ b2 }, { a3 ＜ x ＜ b3 }, { a5 ＜ x ＜ b5 } } ∩ { z ∈ { y2-x2, y5-x5 } }, x ∈ { a ＜ x ＜ b } ∩ y ∈ { c ＜ y ＜ d } ∩ { b-a=c }；

Difference set 1：The common factor of known foundation, then

X1 ∈ { a1 ＜ x1 ＜ b1 }, x2 ∈ { a2 ＜ x2 ＜ b2 }, and x3 ∈ { a3 ＜ x3 ＜ b3 }, xn ∈ { an ＜ xn ＜ bn } }-x ∩ y={ { a2 ＜ x2 ＜ b2 }, { a3 ＜ x3 ＜ b3 }, { a5 ＜ x5 ＜ b5 } }={ x1 ∈ { a1 ＜ x1 ＜ b1 } X4 ∈ { a4 ＜ x4 ＜ b4 }, x6 ∈ { a6 ＜ x6 ＜ b6 }, xn ∈ { an ＜ xn ＜ bn } }；

Difference set 2：The common factor of known foundation, then

Y1 ∈ { c1 ＜ y1 ＜ d1 }, y2 ∈ { c2 ＜ y2 ＜ d2 }, and x3 ∈ { c3 ＜ y3 ＜ d3 }, yn ∈ { cn ＜ yn ＜ dn } }-x ∩ y={ { c2 ＜ y2 ＜ d2 }, { a3 ＜ y3 ＜ d3 }, { c5 ＜ y5 ＜ d5 } }={ y1 ∈ { c1 ＜ y1 ＜ d1 } Y4 ∈ { c4 ＜ y4 ＜ d4 }, y6 ∈ { c6 ＜ y6 ＜ d6 }, yn ∈ { cn ＜ yn ＜ dn } }；

4) logical relation included in it is extracted to the text string of input：

By step 2), 3) know：

X ∧ y, represent that existing x has y again in same sentence；

Expression has x without y in same sentence；

Expression has y without x in same sentence；

Yi-xi=p, it is a constant p to represent the difference in same sentence between y and x；

Yi-xi ＞ p, represent that the difference in same sentence between y and x is more than a constant p；

Yi-xi ＜ p, represent that the difference in same sentence between y and x is less than a constant p；

Bi-ai=Q, represent that a sentence length is equal to a constant Q；

Bi-ai ＞ Q, represent that a sentence length is more than a constant Q；

Bi-ai ＜ Q, represent that a sentence length is less than a constant Q；

5) rule therein is combined：

Represent that existing x has y without z again in same a word；

(yi-xi)=p ∧ (bi-ai)=Q, it is P, sentence length Q to represent the difference in same sentence between y and x；

(yi-xi)=p ∧ (bi-ai) ＞ Q, represent in same sentence, the difference between y and x is P, sentence length is more than Q；

(yi-xi)=p ∧ (bi-ai) ＜ Q, represent in same sentence, the difference between y and x is P, sentence length is less than Q；

(yi-xi) ＞ p ∧ (bi-ai)=Q, is represented in same sentence, and the difference between y and x is more than P, and sentence length is Q；

(yi-xi) ＞ p ∧ (bi-ai) ＞ Q, are represented in same sentence, and the difference between y and x is more than P, and sentence length is big In Q；

(yi-xi) ＞ p ∧ (bi-ai) ＜ Q, are represented in same sentence, and the difference between y and x is more than P, and sentence length is small In Q；

(yi-xi) ＜ p ∧ (bi-ai)=Q, is represented in same sentence, and the difference between y and x is less than P, and sentence length is Q；

(yi-xi) ＜ p ∧ (bi-ai) ＞ Q, are represented in same sentence, and the difference between y and x is less than P, and sentence length is big In Q；

(yi-xi) ＜ p ∧ (bi-ai) ＜ Q, are represented in same sentence, and the difference between y and x is less than P, and sentence length is small In Q；

6) result display output：

Possessed logical relation is extracted according to step 4), logical relation is combined according to step 5), according to step 1), Inquired about in concordance list, Query Result is shown.