KR100955189B1 - Method and system for creating signature data set for searching document - Google Patents

Method and system for creating signature data set for searching document Download PDF

Info

Publication number
KR100955189B1
KR100955189B1 KR1020080078483A KR20080078483A KR100955189B1 KR 100955189 B1 KR100955189 B1 KR 100955189B1 KR 1020080078483 A KR1020080078483 A KR 1020080078483A KR 20080078483 A KR20080078483 A KR 20080078483A KR 100955189 B1 KR100955189 B1 KR 100955189B1
Authority
KR
South Korea
Prior art keywords
signature
signature data
data
document
frequency value
Prior art date
Application number
KR1020080078483A
Other languages
Korean (ko)
Other versions
KR20100019767A (en
Inventor
심규철
Original Assignee
엔에이치엔(주)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 엔에이치엔(주) filed Critical 엔에이치엔(주)
Priority to KR1020080078483A priority Critical patent/KR100955189B1/en
Publication of KR20100019767A publication Critical patent/KR20100019767A/en
Application granted granted Critical
Publication of KR100955189B1 publication Critical patent/KR100955189B1/en

Links

Images

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Library & Information Science (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)

Abstract

A method and system for generating a signature data set for document retrieval are disclosed. The signature data set generation method may include extracting at least one piece of identification data from each of a plurality of documents, generating signature data for each of the documents using the identification data, and determining a document frequency value for each signature data. And generating a signature data set based on the document frequency value.

Figure R1020080078483

Identification data, hash value, signature data, signature string, document frequency value

Description

METHOD AND SYSTEM FOR CREATING SIGNATURE DATA SET FOR SEARCHING DOCUMENT}

The present invention relates to a method and system for generating a signature data set for document retrieval, and more particularly, to a method and system for generating a signature data set using signature data extracted from identification data corresponding to a document element. It is about.

With the recent widespread use of the Internet, digital documents (for example, newspaper articles written on the Internet, blog or blog posts, Internet homepage documents, etc.) are increasing almost indefinitely. Accordingly, a large amount of document processing time is required to search for similar documents in a large document set or to search for documents related to a search term set input by a user. This causes the document retrieval quality to deteriorate.

As a result, there is a demand for a method for searching a large digital document more quickly. In particular, there is a demand for a method that can improve the document processing speed by using the features of the document.

The present invention provides a method and system for generating a signature data set capable of extracting key features of a document by generating signature data using identification data extracted from the document.

The present invention provides a method and system for generating a signature data set capable of extracting document frequency values more efficiently by creating a signature string list representing an efficient data structure for the generated signature data.

The present invention provides a method and system for generating a signature data set capable of generating a signature data set capable of improving a document retrieval speed by determining signature data representing a document frequency value satisfying a preset criterion.

In accordance with an embodiment of the present invention, a method of generating a signature data set includes extracting at least one piece of identification data from each of a plurality of documents, generating signature data for each of the documents using the identification data, and the signature data. The method may include determining a document frequency value for each and generating a signature data set based on the document frequency value.

According to an aspect of the present invention, the determining of the document frequency value for each signature data may include retrieving signature data generated for each of the documents from a pre-registered signature string list, wherein the signature data is present in the signature string list. If so, increasing the document frequency value of the signature data, and if the signature data does not exist in the signature string list, registering the signature data in the signature string list.

The system for generating a signature data set according to an embodiment of the present invention includes an identification data extraction unit for extracting at least one identification data from each of a plurality of documents, and signature data for generating signature data for each of the documents using the identification data. It may include a generation unit, a document frequency value determination unit for determining a document frequency value for each of the signature data, and a signature data set generation unit for generating a signature data set based on the document frequency value.

According to an aspect of the present invention, the document frequency value determination unit is a signature data retrieval unit for retrieving the signature data generated for each document from a pre-registered signature string list, when the signature data is present in the signature string list, A document frequency value counting unit for increasing a document frequency value of the signature data and a signature list registration unit for registering the signature data in the signature string list when the signature data does not exist in the signature string list.

According to the present invention, there is provided a signature data set generation method and system capable of extracting key features of a document by generating signature data using identification data extracted from the document.

According to the present invention, there is provided a method and system for generating a signature data set that can extract a document frequency value more efficiently by creating a signature string list representing an efficient data structure for the generated signature data.

According to the present invention, there is provided a signature data set generation method and system capable of generating a signature data set capable of improving a document retrieval speed by determining signature data representing a document frequency value satisfying a preset criterion.

Hereinafter, with reference to the contents described in the accompanying drawings will be described in detail an embodiment according to the present invention. However, the present invention is not limited or limited by the embodiments. Like reference numerals in the drawings denote like elements. The signature data set generation method according to an embodiment of the present invention may be performed by a signature data set generation system.

1 is a diagram illustrating a process of searching a document using a signature data set generated according to an embodiment of the present invention.

Referring to FIG. 1, N document retrieval systems 102-1 through 102-N are shown. At this time, each of the document retrieval systems 102-1 to 102 -N may receive a document retrieval request from the document collection system 101. Each document retrieval system 102-1 through 102-N may have a document set 103-1 through 103-N. In this case, the documents included in the document sets 103-1 to 103-N may include web documents such as newspaper articles written on the Internet, articles posted on blogs or cafes, and Internet homepage documents.

In one example, document collection system 101 may send a document search request to search for documents associated with the collected document set or search keyword set. The search keyword or document may include at least one word or sentence.

For example, the search keyword may be "travel to Europe" or "to cheaply travel to Europe." The search target information is not limited to the above-mentioned example. When the search target information is a document, each of the document search systems 102-1 to 102-N may determine whether to duplicate or copy a plurality of documents included in the document set.

In one example, document retrieval systems 102-1 through 102-N may include signature data set generation systems. At this time, the signature data set generation system may generate signature data from each of a plurality of documents included in the document set. At this time, the signature data set generation system may generate a signature data set by extracting signature data having a document frequency value of a predetermined reference value or more for document retrieval.

Then, each of the document retrieval systems 102-1 to 102-N may obtain a document retrieval result of the signature data included in the signature data set. In one example, each of the document retrieval systems 102-1 to 102 -N may obtain a document retrieval result using index data for identification data stored by itself or stored by another document retrieval system.

At this time, the index data may store information such as document frequency (indicative of how many documents exist in the identification data) or document list (list of documents including identification data) for the identification data. Since document sets processed for each document retrieval system 102-1 to 102-N are different, index data may also differ for each document retrieval system. The document search result may refer to a search keyword (document list) for signature data generated from a search keyword or a document requested by the document collection system 101.

Then, a document search result for each signature data included in the signature data set may be collected. The document retrieval systems 102-1 to 102-N can process the document retrieval quickly by recycling the previously retrieved document retrieval result for the same signature data.

Hereinafter, a description will be given of a signature data set generation system and method for quickly processing a document search.

2 is a flowchart illustrating a signature data set generation method according to an embodiment of the present invention.

In operation S201, the signature data set generation system may extract at least one piece of identification data from each of the plurality of documents. In this case, the plurality of documents may be documents stored in the document sets 103-1 to 103 -N shown in FIG. 1.

In one example, the signature data set generation system may extract identification data that is a hash value corresponding to a document component of each of a plurality of documents using a hash function. In this case, the document component may mean a phrase or sentence included in the main content of the document. As a result, at least one identification data may be generated for each document. Here, the identification data may be generated from the document component through hash generation functions such as MD5 and SHA-1.

In step S202, the signature data set generation system may generate signature data for each of the plurality of documents using the identification data.

For example, when there is one piece of identification data extracted from the document, the signature data set generation system may generate the same signature data as the extracted identification data. When the identification data extracted from the document is two or more, the signature data set generation system may generate the signature data by connecting the identification data. Alternatively, the signature data set generation system may generate new signature data from the identification data.

As a result, the signature data set generation system can generate one signature data per document. A detailed process of generating signature data is described in FIG. 3.

In operation S203, the signature data set generation system may determine a document frequency value for each signature data. Here, the document frequency value may mean the number of documents corresponding to each signature data with respect to the signature data extracted for each document. For example, when the signature data extracted for each document includes aaaaa and bbbbb, the document frequency value may mean the number of documents representing aaaaa and the number of documents representing bbbbb.

For example, the process of determining the document frequency value by the signature data set generation system may be as follows. The process of determining the document frequency value described below is merely an example, and the present invention is not limited to the example.

(1) The signature data set generation system can retrieve the signature data generated for each document from a list of signature strings registered in advance. In more detail, the signature data set generation system may search for signature data by determining whether signature data generated for each document exists in the signature string. The process of retrieving the signature data can be performed for the entire plurality of documents.

(2) If the signature data set generation system exists in the signature string list, the signature data set generation system can increase the document frequency value of the signature data. (3) If the signature data set generation system does not exist in the signature string list, the signature data set generation system may register the signature data in the signature string list. Through this process, a signature string list indicating a document frequency value for each signature data may be generated. The document string value of each signature string can be extracted from the signature string list.

A process of determining a document frequency value for each signature data is described in detail with reference to FIG. 4.

In step S204, the signature data set generation system may generate the signature data set based on the document frequency value.

In one example, the signature data set generation system may generate a signature data set by extracting signature data whose document frequency value exceeds a predetermined reference value. For example, the signature data set generation system may extract signature data whose document frequency value is higher than N among signature data generated for each document. Alternatively, the signature data set generation system may extract signature data having a document frequency value equal to or greater than a preset threshold (eg, M). The signature data set can be used for efficient document retrieval.

3 is a diagram for describing a process of generating signature data from a document according to an embodiment of the present invention.

3 illustrates the case where there is one identification data extracted from the document (left) and two cases (right).

As can be seen on the left side of FIG. 3, there may be one document component A included in the document. The document component may be a word or sentence constituting the document. In addition, the document component referred to in the present invention may include a word or sentence representing the document.

The signature data set generation system can then extract the identification data aaaaa corresponding to the document component A. FIG. In one example, the identification data may be data obtained by converting a document component according to a hash function.

The signature data set generation system may generate signature data using the identification data. At this time, since there is only one identification data, the signature data set generation system can generate the same signature data aaaaa as the identification data.

As can be seen on the right side of FIG. 3, there may be two document components A and B included in the document. The signature data set generation system can then extract the identification data (aaaaa, bbbbb) corresponding to the document components A and B, respectively.

The signature data set generation system may generate the signature data (aaaaabbbbb) by concatenating the extracted identification data. Alternatively, the signature data set generation system may generate completely new signature data ccccc from the extracted identification data.

The signature data generation process is not limited to the example described with reference to FIG. 4, and various methods may be applied.

4 is a diagram illustrating a process of determining a document frequency value of signature data according to an embodiment of the present invention.

The signature data set generation system may determine a document frequency value for each signature data 401. As can be seen in FIG. 4, signature data 401 generated for each document is provided. The signature data set generation system can retrieve the signature data 401 generated for each document from the pre-registered signature string lists 402 and 403.

If the signature data is present in the signature string list, the signature data set generation system may increase the document frequency value of the signature data 401.

If the signature data does not exist in the signature string list, the signature data 401 may be registered in the signature string list.

In one example, the signature data set generation system sequentially compares the signature string list 402 from the root node to the leaf node in the signature string list 402 and generates a signature string of signature data 401 generated for each of the plurality of documents. You can search for In this case, the signature string list 401 may sequentially display each signature string according to the tree form from the root node to the leaf node, and the leaf node may have a document frequency value.

As shown in FIG. 4, the signature data 401 generated for each document may be aaaaa, baaaa, aaabb, aaacc, bbbbb, bbbzz, zzzzz. Then, the signature data set generation system may compare the signature string of each signature data with a pre-registered signature string list 402 and search for it. For example, the signature data aaaaa may search the tree-type signature string list 402 in the order a-> a-> a-> a-> a. In the case of aaabb, the signature data set generation system searches sequentially from a-> a-> a and then moves from b to the next branch. Through a similar process, the signature data set generation system may retrieve the signature data 401 generated for each of a plurality of documents from the signature string list 402. If aaaaa, the signature data 401, is retrieved, the document frequency value associated with the leaf node, which is the lowest node, is increased by 1 to 10001.

If the signature data is bbbcc, it cannot be searched in the signature string list 402. Then, the signature data set generation system may register each signature string to the signature string list 402 in the order of b-> b-> b-> c-> c from the loop node to the leaf node. At this time, the bbbcc document frequency value may be set to 1, which is a default value.

In one example, the signature data set generation system may retrieve the signature string of the signature data 401 generated for each of the plurality of documents by repeating the sorting and searching in the signature string list 403. In this case, the signature string list may be a structure in which the signature strings are arranged and displayed according to the array type, and each signature string has a document frequency value.

The signature data set generation system may sort the signature strings included in the signature string list 403 and retrieve the signature data 401 based on the sorted signature strings. For example, when the signature data is aaaaa, the array may be sorted and searched in the signature string list 403 in the order of signature strings of the signature data (a-> a-> a-> a-> a). If the signature data is aaaaa, the signature data set generation system can increase the document frequency value since it is retrieved from the signature string list 403. However, if the signature data is ddddd, the signature data set generation system can register the signature string of ddddd in the signature string list 403 because it is not retrieved from the signature string list 403. At this time, the document frequency value of ddddd may be set to 1, which is a default value.

As a result, the signature data set generation system may increase the document frequency value when retrieving the signature data generated for each of the plurality of documents in the signature string list 402 or 403, and register the signature data in the signature string when the signature data is not retrieved. Through this process, a document frequency value for each signature data may be determined for a plurality of documents.

5 is a block diagram illustrating a signature data set generation system according to an embodiment of the present invention.

Referring to FIG. 5, the signature data set generation system may include an identification data extractor 501, a signature data generator 502, a document frequency value determiner 503, and a signature data set generator 504. .

The identification data extractor 501 may extract at least one identification data from each of the plurality of documents. For example, the identification data extractor 501 may extract identification data that is a hash value corresponding to a document component of each of the plurality of documents by using a hash function.

In this case, the document component may mean a phrase or sentence included in the main content of the document. The identification data may be data converted from a document component through hash generation functions such as MD5 and SHA-1.

The signature data generator 502 may generate signature data for each of the documents using the identification data.

For example, when there is one piece of identification data extracted from the document, the signature data generation unit 502 may generate the same signature data as the identification data. When the identification data extracted from the document is two or more, the signature data generation unit 503 may connect the identification data to generate signature data or generate new signature data from the identification data.

In one example, one signature data may be generated for each document. That is, the signature data represents the inherent characteristics of the document, and may be used later when the document retrieval system 103-1 searches for the document. As the signature data is generated through the identification data, various methods may be applied.

The document frequency value determiner 503 may determine a document frequency value for each signature data. As shown in FIG. 5, the document frequency value determining unit 503 may include a signature data retrieval unit 505, a document frequency value counting unit 506, and a signature list registration unit 507.

The signature data retrieval unit 505 may retrieve signature data generated for each of the documents from a pre-registered signature string list. For example, the signature data retrieval unit 505 may sequentially compare and retrieve signature strings of signature data generated for each of the plurality of documents from the root node to the leaf node of the signature string list. In this case, the signature string list may be sequentially displayed according to the tree form from the root node to the leaf node, each of the signature string, the leaf node may have a structure having a document frequency value.

For example, the signature data retrieval unit 505 may search the signature string of the signature data generated for each of the plurality of documents by repeating the sorting and searching in the signature string list. In this case, the signature string list may be a structure in which the signature strings are arranged and displayed according to the array type, and each signature string has a document frequency value.

When the signature data exists in the signature string list, the document frequency value counting unit 506 may increase the document frequency value of the signature data. At this time, the document frequency value may be set to 1 as a default value, and thereafter, the document frequency value may be increased by the number of documents for signature data generated for each of a plurality of documents.

If signature data does not exist in the signature string list, the signature list registration unit 507 may register the signature data in the signature string list. The document frequency value of the registered signature data may be set to 1, which is the default value.

As a result, the document frequency value for each signature data may be determined by repeatedly searching and registering the signature data generated for each of the plurality of documents.

The signature data set generator 504 may generate the signature data set based on the document frequency value. For example, the signature data set generation unit 504 may generate a signature data set by extracting signature data whose document frequency value exceeds a preset reference value.

The signature data set generation unit 504 extracts the signature data whose document frequency value is higher than N out of the signature data generated for each document or signature data whose document frequency value is equal to or greater than a preset threshold (for example, M). You can create a data set. As a result, the signature data set generation unit 504 may generate a signature data set by selecting signature data representing a document frequency value equal to or greater than a predetermined reference value among signature data generated for the entire document included in the document set.

As already mentioned above, when the signature data set is generated through the signature data set generation system 500, each of the document retrieval systems 102-1 through 102-N has a search result for each signature data included in the signature data set. Document list) can be obtained. Each of the document retrieval systems 102-1 to 102-N can perform the document retrieval process more quickly by recycling the already obtained retrieval result for the same signature data.

Parts not described in FIG. 5 may refer to descriptions of FIGS. 1 to 4.

In addition, the method for generating a signature data set according to an embodiment of the present invention includes a computer readable medium including program instructions for performing various computer-implemented operations. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. The medium or program instructions may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks. Magneto-optical media, and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like.

As described above, the present invention has been described by way of limited embodiments and drawings, but the present invention is not limited to the above-described embodiments, which can be variously modified and modified by those skilled in the art to which the present invention pertains. Modifications are possible. Accordingly, the spirit of the present invention should be understood only by the claims set forth below, and all equivalent or equivalent modifications thereof will belong to the scope of the present invention.

1 is a diagram illustrating a process of searching a document using a signature data set generated according to an embodiment of the present invention.

2 is a flowchart illustrating a signature data set generation method according to an embodiment of the present invention.

3 is a diagram for describing a process of generating signature data from a document according to an embodiment of the present invention.

4 is a diagram illustrating a process of determining a document frequency value of signature data according to an embodiment of the present invention.

5 is a block diagram illustrating a signature data set generation system according to an embodiment of the present invention.

<Explanation of symbols for the main parts of the drawings>

500: signature dataset generation system

501: identification data extraction unit

502: signature data generation unit

503: Document frequency determination unit

504: signature data set generation unit

Claims (19)

Extracting at least one identification data from each of the plurality of documents; Generating signature data for each of the documents using the identification data; Determining a document frequency value for each signature data; And Generating a signature data set based on the document frequency value Method for generating a signature data set comprising a. The method of claim 1, Extracting the at least one identification data, And extracting identification data which is a hash value corresponding to a document component of each of the plurality of documents using a hash function. The method of claim 1, Generating signature data for each of the documents, If the extracted identification data is one, generate the same signature data as the identification data, And when the extracted identification data is two or more, generating signature data by connecting the identification data or generating new signature data from the identification data. The method of claim 1, The determining of the document frequency value for each signature data may include: Retrieving signature data generated for each of the documents from a pre-registered signature string list; If the signature data is present in the signature string list, increasing a document frequency value of the signature data; And If the signature data does not exist in the signature string list, registering the signature data in the signature string list. Method for generating a signature data set comprising a. The method of claim 4, wherein The step of retrieving the signature data generated for each of the documents, And a signature string of signature data generated for each of the plurality of documents is sequentially searched from a root node to a leaf node of a signature string list. The method of claim 5, The signature string list is And each signature string is sequentially displayed in a tree form from a root node to a leaf node, and the leaf node has a structure having a document frequency value. The method of claim 4, wherein The step of retrieving the signature data generated for each of the documents, And searching for a signature string of signature data generated for each of the plurality of documents by repeating sorting and searching in a signature string list. The method of claim 7, wherein The signature string list is And the signature strings are arranged and displayed according to an array form, and each of the signature strings has a document frequency value. The method of claim 1, Generating the signature data set, And a signature data set is generated by extracting signature data whose document frequency value exceeds a predetermined reference value. A computer-readable recording medium in which a program for executing the method of any one of claims 1 to 9 is recorded. An identification data extraction unit for extracting at least one identification data from each of the plurality of documents; A signature data generation unit for generating signature data for each of the documents using the identification data; A document frequency value determining unit which determines a document frequency value for each signature data; And Signature data set generation unit for generating a signature data set based on the document frequency value Signature data set generation system comprising a. The method of claim 11, The identification data extraction unit, And extracting identification data which is a hash value corresponding to a document component of each of the plurality of documents using a hash function. The method of claim 11, The signature data generation unit, If the extracted identification data is one, generate the same signature data as the identification data, And when the extracted identification data is two or more, generating signature data by connecting the identification data or generating new signature data from the identification data. The method of claim 11, The document frequency value determining unit, A signature data retrieval unit for retrieving signature data generated for each of the documents from a pre-registered signature string list; A document frequency value counting unit for increasing a document frequency value of the signature data when the signature data exists in the signature string list; And A signature list registration unit that registers the signature data in the signature string list when the signature data does not exist in the signature string list. Signature data set generation system comprising a. The method of claim 14, The signature data search unit, And a signature string of signature data generated for each of the plurality of documents from the root node to the leaf node of the signature string list is sequentially compared and searched. The method of claim 15, The signature string list is And each signature string is sequentially displayed in a tree form from a root node to a leaf node, and the leaf node has a structure having a document frequency value. The method of claim 14, The signature data search unit, And a signature string of signature data generated for each of the plurality of documents by repeating sorting and searching in a signature string list. The method of claim 17, The signature string list is And a signature string is arranged and displayed according to an array type, and each signature string has a document frequency value. The method of claim 11, The signature data set generation unit, And a signature data set is generated by extracting signature data whose document frequency value exceeds a predetermined reference value.
KR1020080078483A 2008-08-11 2008-08-11 Method and system for creating signature data set for searching document KR100955189B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020080078483A KR100955189B1 (en) 2008-08-11 2008-08-11 Method and system for creating signature data set for searching document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020080078483A KR100955189B1 (en) 2008-08-11 2008-08-11 Method and system for creating signature data set for searching document

Publications (2)

Publication Number Publication Date
KR20100019767A KR20100019767A (en) 2010-02-19
KR100955189B1 true KR100955189B1 (en) 2010-04-29

Family

ID=42089986

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020080078483A KR100955189B1 (en) 2008-08-11 2008-08-11 Method and system for creating signature data set for searching document

Country Status (1)

Country Link
KR (1) KR100955189B1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11212980A (en) 1998-01-23 1999-08-06 Fuji Xerox Co Ltd Production of index and retrieval method
KR20080027660A (en) * 2006-09-25 2008-03-28 주식회사 비티웍스 Apparatus and method for management of electronic filing document

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11212980A (en) 1998-01-23 1999-08-06 Fuji Xerox Co Ltd Production of index and retrieval method
KR20080027660A (en) * 2006-09-25 2008-03-28 주식회사 비티웍스 Apparatus and method for management of electronic filing document

Also Published As

Publication number Publication date
KR20100019767A (en) 2010-02-19

Similar Documents

Publication Publication Date Title
US10110658B2 (en) Automatic genre classification determination of web content to which the web content belongs together with a corresponding genre probability
KR100706389B1 (en) Image search method and apparatus considering a similarity among the images
KR101099908B1 (en) System and method for calculating similarity between documents
WO2010047286A1 (en) Search system, search method, and program
Singh et al. OCR++: a robust framework for information extraction from scholarly articles
US7895515B1 (en) Detecting indicators of misleading content in markup language coded documents using the formatting of the document
US20110264997A1 (en) Scalable Incremental Semantic Entity and Relatedness Extraction from Unstructured Text
CN105589894B (en) Document index establishing method and device and document retrieval method and device
CN103324886B (en) A kind of extracting method of fingerprint database in network intrusion detection and system
JP2010262638A (en) Device and method for ranking retrieval result using reliability of representative
CN114372267A (en) Malicious webpage identification and detection method based on static domain, computer and storage medium
CN111160445B (en) Bid file similarity calculation method and device
JP5869948B2 (en) Passage dividing method, apparatus, and program
KR100955189B1 (en) Method and system for creating signature data set for searching document
KR20100105080A (en) Query processing method and apparatus based on n-gram
CN105426490A (en) Tree structure based indexing method
JP2010272006A (en) Relation extraction apparatus, relation extraction method and program
CN112487427A (en) Method, system and server for determining system white list
JP5464082B2 (en) Document processing apparatus, document processing method, document processing program, and computer-readable recording medium recording the document processing program
KR100960488B1 (en) System and method for searching document using signature cache of document
CN111984807B (en) Content screening and storing method and system
JP6044422B2 (en) Abbreviation generation method and abbreviation generation apparatus
JP5906810B2 (en) Full-text search device, program and recording medium
KR102199704B1 (en) An apparatus for selecting a representative token from the detection names of multiple vaccines, a method therefor, and a computer recordable medium storing program to perform the method
Vijayarani et al. An efficient string matching technique for desktop search to detect duplicate files

Legal Events

Date Code Title Description
A201 Request for examination
E701 Decision to grant or registration of patent right
GRNT Written decision to grant
FPAY Annual fee payment

Payment date: 20130329

Year of fee payment: 4

FPAY Annual fee payment

Payment date: 20160329

Year of fee payment: 7

FPAY Annual fee payment

Payment date: 20170328

Year of fee payment: 8

FPAY Annual fee payment

Payment date: 20190401

Year of fee payment: 10