WO2017096777A1

WO2017096777A1 - Document normalization method, document searching method, corresponding apparatuses, device, and storage medium

Info

Publication number: WO2017096777A1
Application number: PCT/CN2016/087058
Authority: WO
Inventors: 黄岳; 马晋; 张显; 张晓婧; 曹冰; 徐学睿; 李玉鹏; 杰艺
Original assignee: 百度在线网络技术（北京）有限公司
Priority date: 2015-12-07
Filing date: 2016-06-24
Publication date: 2017-06-15
Also published as: CN105447169B; CN105447169A

Abstract

Disclosed are a document normalization method, a document searching method, corresponding apparatuses, a device, and a storage medium. The document normalization method comprises: acquiring documents from one or more website sources; standardizing the acquired documents; according to the degree of similarity among titles of the standardized documents, clustering documents with similar titles to obtain a plurality of document sets; calculating the degree of similarity among the documents in each document set, and according to the calculated degree of similarity among the documents, screening out a document set which meets a requirement; and clustering the same documents in the document set which is screened out and meets the requirement, and gathering publishing sources of the same documents together. The document searching method comprises: receiving a keyword inputted by a user; according to the keyword, searching out documents associated with the keyword; and in the search result, displaying the same documents by means of aggregation, and displaying the publishing source of each document. Compared with the prior art, normalization of the same documents is implemented in the present invention, and a basis for improving the efficiency of document searching is provided.

Description

Document normalization method, literature search method and corresponding device, device and storage medium

The present application claims priority to Chinese Patent Application No. 201510888584.5, entitled "Document Normalization Method, Document Search Method, and Corresponding Device".

Technical field

The present invention relates to the field of computer application technologies, and in particular, to a document normalization method, a document search method, and corresponding devices, devices, and storage media.

Background technique

When conducting scientific research, researchers need to find research literature for investigation. Usually when looking for scientific research literature, you need to find a specific article accurately, and find the electronic source channel of the article as much as possible. However, there are some inconveniences in the actual retrieval.

Due to the large number of scientific researchers and the published research literature, there are some documents with the same authors and the same title. Users need to identify which ones are the same and which are not, and finally determine what they really need. This process is cumbersome and increases the cost of searching for users.

As shown in Figure 1, when a user searches for a document, a certain document may have multiple sources of electronic sources, and the data quality of each electronic source channel is different. The user cannot obtain all the electronic sources of the same document, and can only search. Viewing a source from a source is not conducive to filtering quality and licensed resources, reducing the user experience.

Summary of the invention

The invention provides a document normalization method, a literature search method and a corresponding device, so as to achieve the normalization of the same document, and provide a basis for improving the effect of the literature search.

The specific technical solutions are as follows:

A document normalization method, including:

Obtain documents from more than one website source;

Standardize the documents obtained;

According to the similarity of the titles of the standardized documents, the documents of similar titles are clustered to obtain a plurality of document collections;

Calculating the similarity of the documents in each document collection, and filtering out the qualified document collection according to the similarity of the calculated documents;

For the selected set of qualified documents, cluster the same documents, and summarize the publication sources of the same documents.

According to a preferred embodiment of the invention, the similarity of the titles of the documents is determined in at least one of the following ways:

Calculating the signature for the title of the document and calculating the similarity between the title signatures of the document;

The Hamming distance between the titles of the documents is calculated, and the similarity between the titles of the documents is determined according to the Hamming distance.

According to a preferred embodiment of the present invention, before the calculating the similarity of the document in each document collection, the method further comprises:

According to the author of the standardized literature, the similarity of at least one attribute in the source and the publication year is published, and similar documents are clustered to obtain a plurality of document collections.

According to a preferred embodiment of the present invention, according to the author of the standardized document, the similarity of at least one of the publication source and the publication year is determined by at least one of the following methods:

Combining the author of the standardized document, the source of publication, and the year of publication into a string, calculating the signature of the merged string, and calculating the similarity between the signatures of the merged strings of the document;

The authors of the standardized literature, the source of publication and the year of publication are combined into a string, the Hamming distance between the merged strings is calculated, and the author of the document, the source of the publication, and the similarity of the publication year are determined according to the Hamming distance.

According to a preferred embodiment of the present invention, after obtaining a plurality of document collections and calculating the similarity of the documents in each document collection, the method further comprises:

Based on the Hamming distance between the documents in the collection of documents, a collection of documents whose Hamming distance is less than or equal to a preset threshold is screened.

According to a preferred embodiment of the present invention, the screening of the qualified document collection according to the similarity of the calculated documents comprises:

In each document collection, the similarity between each document in each document collection is calculated according to the weight corresponding to each document attribute set in advance, and the document collection with the similarity greater than the preset total score among the documents is determined to be consistent. A collection of documents for conditions.

According to a preferred embodiment of the present invention, the clustering of the selected documents is performed on the selected set of qualified documents, including:

Performing a key-value pair forming process for each of the selected document sets that are selected, the key-value pair forming process includes: respectively, each document is used as a key, and other documents are used as values corresponding to the key, thereby forming at least two keys. -value pair;

Clustering the same key-value pairs of keys to a set based on all the key-value pairs obtained;

The key set formation process is performed separately for the obtained set until the preset number of iterations is reached.

According to a preferred embodiment of the invention, the standardization comprises:

Sub-word processing of the full name of the first author of the document, extracting the first letter of each word, The extracted initials are used as the authors of the standardized literature; or,

Extract the longest sentence in the body part of the document abstract and calculate the signature of the longest sentence; or,

Uniform literature source format; or,

The format of the publication time of the document is unified, or only the year in which the publication of the document is published.

According to a preferred embodiment of the invention, the calculating a signature for the title of the document comprises:

Dividing the title of the document into a plurality of subtitles, calculating a length of each subtitle, and extracting subtitles whose subtitles are longer than a preset length;

Determining an n-gram feature of the extracted subtitle, wherein the value of n is a positive integer from 1 to N, and the N is a preset positive integer;

The signature of the title of the document is calculated based on the determined n-gram characteristics.

A document search method, the method comprising:

Receiving keywords input by the user;

Searching for documents associated with the keyword based on the keyword;

In the search results, the same documents are aggregated and displayed, and the publication sources of each document are displayed;

The same literature is normalized by the method of normalization of the literature.

A document normalization device comprising:

An acquisition unit for obtaining documents from more than one website source;

Standardization unit for standardizing the acquired documents;

a first clustering unit, configured to cluster the documents of similar titles according to the similarity of the titles of the standardized documents to obtain a plurality of document collections;

a first screening unit for calculating the similarity of the document in each document collection, according to the Calculating the similarity of the documents to screen out a set of qualified documents;

The second clustering unit is configured to perform clustering of the same documents on the selected qualified document collections, and summarize the publication sources of the same documents.

According to a preferred embodiment of the present invention, the first clustering unit determines the similarity of the titles of the documents in at least one of the following manners:

According to a preferred embodiment of the present invention, the first clustering unit is further configured to: before the calculating the similarity of the document in each document collection, according to the author of the standardized document, at least the publication source and the publication year A similarity of attributes, clustering similar documents to obtain multiple sets of documents.

According to a preferred embodiment of the present invention, the first clustering unit determines the similarity of the at least one attribute in at least one of the following manners:

According to a preferred embodiment of the present invention, the method further includes:

a second screening unit, configured to filter out the Hamming distance between the documents in the document collection after obtaining the plurality of document collections and calculating the similarity of the documents in each document collection A collection of documents whose Hamming distance is less than or equal to a preset threshold.

According to a preferred embodiment of the present invention, the first screening unit is specifically configured to: in each document set, calculate a similarity between each document in each document collection according to a weight corresponding to each document attribute set in advance, A collection of documents in which the similarity between the documents is greater than the preset total score is determined as a set of qualified documents.

According to a preferred embodiment of the present invention, when the second clustering unit performs clustering of the same document on the selected set of qualified documents, the second clustering unit performs:

According to a preferred embodiment of the present invention, the standardization unit is specifically configured to:

Performing word segmentation on the full name of the first author of the document, extracting the first letter of each word, and using the extracted initial combination as the author of the standardized document; or

Uniform literature source format; or,

According to a preferred embodiment of the present invention, when the first clustering unit calculates a signature for the title of the document, the specific execution is:

Determining an n-gram feature of the extracted subtitle, the value of n being a positive integer from 1 to N, the N being a preset positive integer

A document search device, the device comprising:

a receiving unit, configured to receive a keyword input by a user;

a matching unit, configured to search for a document associated with the keyword according to the keyword;

a presentation unit for synthesizing the same documents in the search results and presenting the publication sources of the respective documents, wherein the same documents are normalized using the device of the document normalization.

It can be seen from the above technical solutions that the present invention can accurately aggregate the same documents together and clearly provide the source of the literature. When the user searches for the documents, the different publication sources of the same document can be brought together and presented to the user. Improved user experience.

DRAWINGS

Figure 1 is a schematic diagram of a search document in the prior art.

2 is a flow chart of a method for normalizing documents according to an embodiment of the present invention.

3 is a schematic diagram of standardization of an author in an embodiment of the present invention.

FIG. 4 is a schematic diagram of clustering the same document according to an embodiment of the present invention.

FIG. 5 is a schematic diagram of a search result presentation provided by an embodiment of the present invention.

FIG. 6 is a schematic diagram of signature processing of two titles in the reduce phase in the embodiment of the present invention.

FIG. 7 is a flow chart of another method for normalizing documents according to an embodiment of the present invention.

FIG. 8 is a schematic structural diagram of an apparatus according to an embodiment of the present invention.

FIG. 9 is a schematic structural diagram of an embodiment of the first clustering unit of FIG. 8. FIG.

FIG. 10 is a schematic structural diagram of an embodiment of the signature calculation unit of FIG. 8.

Figure 11 is a block diagram showing the structure of an apparatus for searching using the document normalization method.

detailed description

The present invention will be described in detail below with reference to the drawings and specific embodiments.

Figure 2 is a flow chart showing the first embodiment of the document normalization method of the present invention. As shown in Figure 2, the document normalization method includes:

S10, obtain documents from all website sources.

Specifically, documents are obtained from all websites by way of web crawling.

S11, standardizing the acquired documents.

In an embodiment of the invention, the normalization is to normalize the attributes of the document, the attributes of the document including title, author, abstract, publication source, publication time, and the like.

Specifically, the standardization of the title includes the segmentation of the title, the unification of the full-width half-width, the removal of the punctuation of the title, and the like. For example, the title of a document is re:Coagulation and -Flocculation, which is re Coagulation and--Flocculation after standardization of the title.

Since the authors of the site may have different abbreviations, the authors of the literature need to be standardized. The principle of standardization for the author is to extract the full name of the first author of the document, divide the full name of the first author into multiple words, extract the first letter of each word, and finally sort all the initials extracted as a document. Corresponding author. When the first author's full name is divided into multiple words, when there are multiple uppercase letters abbreviated together, each uppercase letter is divided into one word.

As shown in Fig. 3, it is a schematic diagram of standardization of the author in the present invention. In this example, the names of an author obtained from the network are: Carlos N.Slia, Carlos Nascimento.Slia and SN Carlos. Carlos N.Slia is divided into three words: Carlos, N and Slia. The first letters of these three words are C, N, and S. Carlos Nascimento.Slia was split into Carlos, Nascimento and Slia, taking the first letters of the three words C, N, S. SN Carlos is split into S, N, and Carlos. The first letters of these three words are S, N, and C. Finally sorted into CNS in alphabetical order.

The principle of standardization of the abstract is to extract the main part of the abstract, calculate the length of the sentence in the main part, find the sentence with the longest length, and calculate the signature of the abstract of the document. In other embodiments, sentences of other lengths may also be used. The signature of the digest of the document can be calculated using the Message Digest Algorithm (MD5).

Publication sources include journals, conferences, and collections. The standardization of the source of publication is mainly to unify the format of the source of publication, including uniformization of uppercase and lowercase, deletion of symbols, and unification of full-width corners.

Standardization of publication time includes extracting year data from publication time. On the web, there are various time formats for publication of documents, and standardization of publication time involves extracting years from a variety of different time formats. For example, the publication time was: 1990, 1990-11-11, 1990/11/11, and the standardization of the publication time was obtained in 1999. Of course, in addition to the method of extracting only the year, the same expression can be used, for example, the expressions 1990-11-11, 1990/11/11, November 11, 1990, 1990.11.11 are unified into 1990. -11-11.

S12. Clustering the documents of similar titles according to the similarity of the titles of the standardized documents to obtain a plurality of first sets. The first set includes at least two documents.

S13: Calculate the similarity of the document in each first set, and select a plurality of first sets that meet the condition according to the similarity of the calculated documents.

Specifically, the weight corresponding to the document attribute is set in advance, and the document attribute may be a feature such as an author, a summary, a publication source, and a publication time. In each of the first sets, the similarity of each document in each first set is calculated according to the weight corresponding to the preset document attribute, and the first set whose similarity of each document is greater than the preset total score is determined to be in accordance with The first set of conditions.

For example, there are two documents in a first set, assuming that the author has a weight of 4, the abstract weight is 2, the journal weight is 2, the publication time weight is 2, and the default total score is 5. The characteristics of document a are as follows. : a General Stability Result for Viscoelastic Equations with Singular Kernels, author: MM Cavalcanti, Journal: missing, published: 1999-02-11, summary signature: b47b61cad59b93c5ad99e8820b71f4db; b literature features the following title: a General Stabilities Result for Viscoelastic Equations With Singular Kernels, author MC Murphy, Journal: Journal of Applied & Computational Mathematics, published: 1999, abstract signature: b47b61cad59b93c5ad99e8820b71f4db; document a is the same as the author of document b, the author corresponds to the value of 1 * 4, the same reason The publication a is different from the publication source of document b. The value corresponding to the publication source is 0*2. Therefore, the similarity of the two documents calculated is: 1*4+0*2+1*2+1 *2=8>5, so the document a is considered to be the same as the document b. If the document b is the same as the document c, the document a, the document b, and the document c are the same. This will cluster the same documents together.

S14: Clustering the same documents in the plurality of eligible first sets that are selected, and summarizing the publication sources of the same documents. Links to the publication sources of the same literature can be summarized.

Specifically, the key value pair forming process is performed separately for each of the selected first sets that are selected, and the key value pair forming process in the first set that meets the condition includes: respectively, each document is used as a key, and other documents are used as The key corresponds to the value, thereby forming at least two key-value pairs; according to all the key-value pairs obtained, the key-value pairs with the same key are clustered into one set; respectively, the obtained set is transferred to execute the key The value pair formation process until a preset number of iterations is reached, the preset number of iterations being an empirical value.

The mapreduce model may be used to cluster the same documents in the plurality of eligible first sets that are selected. Specifically, each of the selected first sets that are filtered out is used as an input of a map stage, and a key-value pair corresponding to each of the eligible first sets is output in the map stage. All the key-value pairs corresponding to the first selected first set are sorted, and all the key-value pairs after sorting are used as the input data of the reduce stage, and the keys with the same key are in the reduce stage. The value pairs are clustered into a set, so the reduce stage outputs multiple sets, and the documents in each set form multiple key-value pairs as the input of the reduce stage. The above method is used to iterate multiple times until the preset is reached. The number of iterations, the same documents in the selected plurality of eligible first sets are aggregated into one class, and all the publication sources of the document are included in the class.

For example, as shown in FIG. 4, if each of the eligible first sets that are selected includes two documents, the plurality of selected first sets that are selected are (a, b), (b, c), (d, f), the key-value pairs output by the plurality of eligible first sets that are filtered in the map stage are ab, ba, bc, cb, df, fd. Sorting the keys of the key-value pairs corresponding to the first selected multiple eligible sets: ab, ba, bc, cb, df, fd, output [a b], [a b] in the reduce stage c], [c b], [d f], [f d], according to the above method, then combine the two or two documents in [a b] into multiple key-value pairs, namely ab, ba; the same reason will be [ a b c] Two or two documents constitute multiple key-value pairs. In [c b], two or two documents constitute multiple key-value pairs, and [d f] two or two documents constitute multiple key-value pairs, [f d] two pairs The literature consists of multiple key-value pairs as input to the map stage. So many iterations can get (a, b, c) as a class and (d, f) as a class.

Further, in other embodiments, a search method using the document normalization method in the embodiment includes receiving a keyword input by a user; and matching, according to the keyword, all the documents associated with the keyword Send the published source of all associated documents and each associated document to the user. Specifically, a link to the publication source of each of the associated documents is displayed to the user. In this way, the user brings together the different publishing source links of the same article, which improves the user experience. As shown in FIG. 5, the present invention is a schematic diagram in which the publication sources of the same document are gathered together. Compared with the content shown in FIG. 1, the same document as "simulation study on angle measurement accuracy of star sensor" in FIG. Aggregate together and present links to all sources of the document to the user, as framed by the box in Figure 5, the sources of the same literature include: ReserchGate, SPIE, reviews.spiedigita, the same documents from these sources were The aggregation is presented and the sources are shown for user selection.

Preferably, as an implementation of S12, the similarity of the titles of the documents is determined according to the similarity between the title signatures of the documents and the Hamming distance between the titles of the documents, and S12 includes:

S120. Calculate a signature of the title of the document according to the title of the standardized document.

S121, according to the signature of the title of each document, the documents with similar titles are clustered to obtain a plurality of first clusters. The first cluster includes at least two documents.

Specifically, the key value pair forming process is performed on the signature of any one of the titles, and the title is first The signature is divided into T parts, the T is a preset value, each block of the title is used as a key, and the signature of the title is used as a value, so that the title corresponds to T key-value pairs. According to the above method, each title will correspond to T key-value pairs. When there are at least one key in the T key-value pairs corresponding to the two titles, the documents corresponding to the two titles are clustered into a first cluster output.

The documents with similar titles can be clustered by using the mapreduce model to obtain a plurality of first clusters, and the mapreduce model includes a map phase and a reduce phase. The input data is processed by the map, and then subjected to reduce processing to finally obtain the output data. The output of the map phase is in the form of a key-value pair. The T blocks of each title are respectively input as the map stage, and the T key-value pairs corresponding to each title are output in the map stage. In the reduce phase, when at least one key is the same among the T key-value pairs corresponding to the two titles, the reduce stage clusters the documents corresponding to the two titles into a first cluster output.

For example, as shown in FIG. 6, the signature of the title of the document a is 111111000100100, divided into four parts of 1111, 1110, 0010, 0100, and the signature of the title of the document b is 1101111000000000 divided into four parts of 1101, 1110,000, 0000, from As can be seen in Figure 6, the second block of the signature of the title of document a is identical to the second block of the signature of the title of document b. The document a and the document b are clustered into a first cluster.

S122. Calculate a Hamming distance between documents in each first cluster according to the signature of the title of the document in each first cluster.

If the corresponding values at a certain position of the two title signatures are different, the Hamming distance is 1. If there are different values in the two positions, then the Hamming distance is 2, and so on. For example, the signature of the title of document a is 111111000100100, the signature of the title of document b is 1101111000000000, the third digit of document a and document b, the eleventh digit, the 14th digit The number is different, and the Hamming distance between the document a and the document b is 3.

S123: Filter out a first cluster whose Hamming distance is less than or equal to a preset threshold, and select a plurality of eligible first clusters to be a plurality of first sets obtained by clustering documents of similar titles.

In other embodiments, the similarity of the titles of the documents may be determined based on the similarity between the title signatures of the documents or the Hamming distance between the titles of the documents.

Preferably, as an implementation manner of S120, in S120, the method further includes:

(1) Divide the title of the document into multiple subtitles, such as by dividing the uppercase letters. Calculate the length of each subtitle, and extract subtitles whose subtitles are longer than the preset length.

For example, the preset length is 10 characters, and the standardized title is R Genre classification via an lz78-based string kernel, and then divided into R and Genre classification via an lz78-based string kernel. R is 1 character and its length is less than 10 characters, so R is excluded.

(2) determining an n-gram feature of the extracted subtitle, the value of n being from 1 to N, and the value of N is set according to the length of the extracted subtitle.

For example, the title is "A B C", and if m is 3, the title of the document is characterized by [A, B, C, AB, BC, ABC].

(3) Calculate the signature of the title of the document based on the determined n-gram characteristics. The signature of the title of the document can be calculated using the simhash algorithm, and the signature of the title of the document is calculated to be an n-bit signature consisting of 0 and 1. For example, a 64-bit signature, a 16-bit signature, and the like.

In the first embodiment of the above-mentioned document normalization method, there are still deficiencies. Due to the reasons for website collection, editing, and coding, the titles of the respective publication sources of the same document may have a large gap. In order to overcome the above problems, the title of the document is used. Simultaneous use of the literature in parallel One author, the source of the publication and the time of publication were clustered to make up for the lack of clustering using only the title.

As shown in FIG. 7, it is a schematic flowchart of Embodiment 2 of the normalization method of the document of the present invention, and the normalization method of the document includes:

S20, obtain documents from all website sources.

Specifically, documents are obtained from all websites by way of web crawling.

S21, standardizing the acquired documents.

Specifically, the standardization of the title includes the segmentation of the title, the unification of the full-width half-width, the removal of the punctuation of the title, and the like.

Since the authors of the site may have different abbreviations, the authors of the literature need to be standardized. The principle of standardization for the author is to extract the full name of the first author of the document, divide the full name of the first author into multiple words, extract the first letter of each word, and finally sort all the initials extracted into the corresponding documents. author. When the first author's full name is divided into multiple words, when there are multiple uppercase letters abbreviated together, each uppercase letter is divided into one word.

Standardization of publication time includes the extraction of the year from the publication time. On the web, the text The publication time will be in a variety of time formats, and the standardization of publication time includes the extraction of years from a variety of different time formats. Of course, in addition to the method of extracting only the year, it is also possible to adopt a method of unifying the same expression.

S22, according to the similarity of the titles of the standardized documents, clustering the documents of similar titles to obtain a plurality of first sets, and according to the first author of the standardized documents, the similarity of the publishing source and the publication year, Similar documents are clustered to obtain multiple second sets.

S23. Calculate the similarity of the document in each first set, select a plurality of first sets that meet the condition according to the similarity of the calculated documents, and calculate the similarity of the documents in each second set, according to the The similarity of the calculated documents selects a plurality of eligible second sets.

Specifically, the weight corresponding to the document attribute is set in advance, and the document attribute may be a feature such as an author, a summary, a publication source, and a publication time. In each of the first set and the second set, the similarity of each document in each of the first set and the second set is calculated according to the weight corresponding to each document set in advance, and the similarity of each document is greater than the preset total The first set or the second set is determined to be an eligible first set or second set.

S24: Perform clustering of the same document on the plurality of selected first sets that are selected and the plurality of selected second sets that are selected, and summarize the publication sources of the same documents. Links to the publication sources of the same literature can be summarized.

Specifically, the key value pair forming process is performed separately for each of the selected first set and the second set that are selected, and the forming process of the key value pair in the first set or the second set that meets the condition includes: respectively Each document is used as a key, and other documents are used as values corresponding to the key, thereby forming at least two key-value pairs; according to all the key-value pairs obtained, the key-value pairs having the same key are clustered into one set; The collection proceeds to perform the key-value pair formation process until the preset number of iterations is reached, and the plurality of selected matches are The same set of conditions and the same documents in the second set are aggregated into one class. The preset number of iterations is an empirical value.

The mapreduce model may be used to cluster the plurality of selected first sets and the same documents in the second set. Specifically, each of the selected first set and the second set are selected as input of a map stage, and a key-value pair corresponding to each of the first set and the second set that meets the condition is output in the map stage. All the selected key-value pairs corresponding to the first set and the second set are sorted, and all sorted key-value pairs are used as input data of the reduce stage, and the key is used in the reduce stage. The same key-value pairs are clustered into a set, so that the reduce stage outputs multiple sets, and the documents in each set form a plurality of key-value pairs as input of the reduce stage, and are iterated multiple times by using the above method until reaching The preset number of iterations aggregates the selected plurality of eligible first and second collections into a class, including all publication sources of the document.

Further, in other embodiments, a search method using the document normalization method in the embodiment includes receiving a keyword input by a user; and matching, according to the keyword, all the documents associated with the keyword Send the published source of all associated documents and each associated document to the user. Specifically, a link to the publication source of each of the associated documents is displayed to the user. In this way, the user brings together the different publishing source links of the same article, which improves the user experience.

Preferably, as an embodiment of S22, the similarity of the titles of the documents is determined according to the similarity between the title signatures of the documents and the Hamming distance between the titles of the documents. The author of the document, the source of the publication and the similarity of the publication year are determined by combining the author of the standardized document, the publication source and the publication year into a string, and calculating the combined string. The signature is then determined based on the similarity between the signatures of the merged strings of the document and the Hamming distance between the merged strings of the document. Then S22 includes:

S220: Calculate the signature of the title of the document according to the title of the standardized document, and combine the first author of the standardized document, the publication source and the publication year into a character string, and calculate the signature of the merged character string.

For example, the first author of the literature is MC Murphy, published under the journal Journal of Applied Computational Mathematics, published in 1999, and merged into a string called MC Murphy/Journal of Applied Computational Mathematics/1999.

S221, according to the signature of the title of each document, clustering two documents with similar titles to obtain a plurality of first clusters, and combining the two merged characters according to the signature of the merged string of each document. Strings of similar documents are clustered to obtain a plurality of second clusters. The first cluster or the second cluster includes at least two documents.

Specifically, the key value pair forming process is performed on the signature of any one of the titles, and the signature of the title is first divided into T parts, and the T is a preset value, and each piece of the title is used as a key, and the title is Signature as value, so the title will correspond to T key-value pairs. According to the above method, each title will correspond to T key-value pairs. When there are at least one key in the T key-value pairs corresponding to the two titles, the documents corresponding to the two titles are clustered into a first cluster output. Similarly, the above method is performed for the signature of the merged character string of each document. When at least one of the T key-value pairs corresponding to the two merged strings is the same, the documents corresponding to the two merged strings are clustered into a second cluster output.

The documents with similar titles can be clustered by using the mapreduce model to obtain a plurality of first clusters, and the mapreduce model includes a map phase and a reduce phase. Input data The map is processed, and then processed by reduce, and finally the output data is obtained. The output of the map phase is in the form of a key-value pair. The T blocks of each title are respectively input as the map stage, and the T key-value pairs corresponding to each title are output in the map stage. In the reduce phase, when at least one key is the same among the T key-value pairs corresponding to the two titles, the reduce stage clusters the documents corresponding to the two titles into a first cluster output. Similarly, the mapreduce model can be used to cluster two merged strings with similar documents to obtain multiple second clusters.

S222. Calculate, according to the signature of the title of each document in each first cluster, a Hamming distance between each document in each first cluster, and a signature of the combined character string according to each document in each second cluster. Calculate the Hamming distance between each document in each second cluster.

If the corresponding values at a certain position of the two title signatures are different, the Hamming distance is 1. If there are different values in the two positions, then the Hamming distance is 2, and so on. If the corresponding values at a certain position of the signature of the two merged strings are different, the Hamming distance is 1. If there are different values in the two positions, then the Hamming distance is 2, and so on.

S223: Filter out a first cluster whose Hamming distance is less than or equal to a preset threshold, and select a plurality of first clusters that meet the condition to be a plurality of first sets obtained by clustering documents of similar titles, and screening The second cluster with the Hamming distance less than or equal to the preset threshold is selected, and the selected second clusters that are selected are the plurality of second obtained by clustering the documents corresponding to the similar merged character strings. set.

In other embodiments, the similarity of the titles of the documents may be determined based on the similarity between the title signatures of the documents or the Hamming distance between the titles of the documents. The author of the document, the publication source and the degree of similarity of the publication year may be determined based on the similarity between the signatures of the merged character strings of the document or the Hamming distance between the merged strings of the documents.

Preferably, as an implementation manner of S220, in S220, the method further includes:

(2) Dividing the merged string into multiple substrings, calculating the length of each substring, and extracting the substring whose substring is longer than the preset length.

(3) determining an n-gram feature of the extracted subtitle, the value of n being from 1 to N, and the value of N is set according to the length of the extracted subtitle.

(4) Determine the n-gram feature of the extracted substring.

(5) Calculate the signature of the title of the document based on the determined n-gram characteristics. The signature of the title of the document can be calculated using the simhash algorithm, and the signature of the title of the calculated document is an n-bit signature consisting of 0 and 1.

(6) Determine the n-gram feature of the extracted substring, and calculate the signature of the merged character string of the document. The simhash algorithm can be used to calculate the signature of the merged string. The calculated signature of the merged string is an n-bit signature consisting of 0 and 1.

Embodiment 3

Figure 8 is a schematic block diagram showing the construction of an embodiment of the apparatus of the present invention. Referring to FIG. 8, the apparatus includes: an obtaining unit 100, a normalization unit 101, a first clustering unit 102, a first screening unit 103, and a second clustering unit 104.

The obtaining unit 100 is configured to obtain documents of all website sources.

Specifically, documents are obtained from all websites by way of web crawling.

The normalization unit 101 is for standardizing the acquired documents.

In an embodiment of the present invention, the standardization is to standardize attributes of a document, and the attributes of the document include a title, an author, an abstract, a publication source, a publication time, and the like. Normalization.

Specifically, the normalization of the title by the normalization unit 101 includes unification of the full-width of the segmentation half-width of the title, removal of the punctuation of the title, and the like.

The principle of the standardization unit 101 standardizing the author is to extract the full name of the first author of the document, divide the full name of the first author into a plurality of words, extract the initials of each word, and finally sort all the initials extracted into The author of the document. When the first author's full name is divided into multiple words, when there are multiple uppercase letters abbreviated together, each uppercase letter is divided into one word.

The principle of standardization of the abstract by the normalization unit 101 is to extract the main part of the abstract, calculate the length of the sentence in the main part, find the sentence with the longest length, and calculate the signature of the abstract of the document. In other embodiments, sentences of other lengths may also be used. The signature of the digest of the document can be calculated using the Message Digest Algorithm (MD5).

Publication sources include journals, conferences, and collections. The standardization unit 101 standardizes the source of the publication mainly by the format of the unified publication source, including unified capitalization, deletion of symbols, and unification of the full-width half-width.

The normalization of the publication time by the normalization unit 101 includes extracting the year from the publication time. On the network, the publication time of the document has various time formats, and the normalization unit 101 can extract the year from various different time formats. Of course, in addition to the method of extracting only the year, it is also possible to adopt a method of unifying the same expression.

The first clustering unit 102 is configured to cluster the documents of similar titles according to the similarity of the titles of the standardized documents to obtain a plurality of first sets. The first set includes at least two documents.

The first screening unit 103 is configured to calculate the similarity of the documents in each of the first sets, according to The similarity of the calculated documents filters out a plurality of first sets that meet the criteria.

Specifically, the first screening unit 103 is configured to: preset a weight corresponding to the document attribute, where the document attribute may be an author, a summary, a publication source, a publication time, and the like. In each of the first sets, the similarity of each document in each first set is calculated according to the weight corresponding to the preset document attribute, and the first set whose similarity of each document is greater than the preset total score is determined to be in accordance with The first set of conditions.

The second clustering unit 104 is configured to cluster the same documents in the selected plurality of eligible first sets, and summarize the publication sources of the same documents. Links to the publication sources of the same literature can be summarized.

Specifically, the second clustering unit 104 is configured to: perform a key-value pair forming process for each of the selected first sets that are selected, and the key-value pair forming process in the first set that meets the condition includes: respectively Each document is used as a key, and other documents are used as the value corresponding to the key, thereby forming at least two key-value pairs; according to all the key-value pairs obtained, the key-value pairs having the same key are clustered into one set; The obtained set is transferred to perform the key value pair forming process until the preset number of iterations is reached, and the same documents in the selected plurality of eligible first sets are aggregated into one class.

The mapreduce model may be used to cluster the same documents in the plurality of eligible first sets that are selected. Specifically, each of the selected first sets that are filtered out is used as an input of a map stage, and a key-value pair corresponding to each of the eligible first sets is output in the map stage. All the key-value pairs corresponding to the first selected first set are sorted, and all the key-value pairs after sorting are used as the input data of the reduce stage, and the keys with the same key are in the reduce stage. The value pairs are clustered into a collection, so the reduce phase will output multiple collections, and the documents in each collection will be composed more. The key-value pair is used as the input of the reduce phase, and is iterated multiple times by using the above method until the preset number of iterations is reached, and the same documents in the selected plurality of eligible first sets are aggregated into one class, in the class. Includes all publication sources for this article.

As shown in FIG. 9, it is a schematic structural diagram of an embodiment of the first clustering unit 102 of the present invention. The first clustering unit 102 includes a signature calculation unit 1020, a signature clustering unit 1021, a distance calculation unit 1022, and a second screening unit 1023.

The signature calculation unit 1020 is configured to calculate a signature of the title of the document based on the title of the standardized document.

The signature clustering unit 1021 is configured to cluster two documents with similar titles according to the signature of the title of each document to obtain a plurality of first clusters. The first cluster includes at least two documents.

Specifically, the signature clustering unit 1021 is configured to: perform a key-value pair forming process on the signature of any one of the titles, first divide the signature of the title into T-numbers, and the T is a preset value, and each of the titles The block is used as the key, and the signature of the title is used as the value, so that the title corresponds to T key-value pairs. According to the above method, each title will correspond to T key-value pairs. When there are at least one key in the T key-value pairs corresponding to the two titles, the documents corresponding to the two titles are clustered into a first cluster output.

The distance calculation unit 1022 is configured to calculate a Hamming distance between documents in each of the first clusters based on the signature of the title of the document in each of the first clusters.

The second screening unit 1023 selects a first cluster whose Hamming distance is less than or equal to a preset threshold, and the selected plurality of eligible first clusters are a plurality of first sets obtained by clustering documents of similar titles. .

In the above embodiment, the similarity of the titles of the documents is determined based on the similarity between the title signatures of the documents and the Hamming distance between the titles of the documents. In other embodiments, the similarity of the titles of the documents may be determined based on the similarity between the title signatures of the documents or the Hamming distance between the titles of the documents.

FIG. 10 is a schematic structural diagram of an embodiment of a signature calculation unit of the present invention. The signature calculation form 1020 includes an extraction unit 10201, a determination unit 10202, and a calculation unit 10203.

The extracting unit 10201 is configured to divide the title of the document into a plurality of subtitles, such as may be divided into uppercase letters. Calculate the length of each subtitle, and extract subtitles whose subtitles are longer than the preset length.

The determining unit 10202 is configured to determine an n-gram feature of the extracted subtitle, the value of the n is from 1 to N, and the value of the N is set according to the length of the extracted subtitle.

The calculating unit 10203 is configured to calculate a signature of a title of the document according to the determined n-gram feature. The signature of the title of the document can be calculated using the simhash algorithm, and the signature of the title of the document is calculated to be an n-bit signature consisting of 0 and 1. For example, it is a 64-bit signature, a 16-bit signature, and the like.

Embodiment 4

Referring to FIG. 8 , the acquiring unit 100, the normalizing unit 101, the first clustering unit 102, the first screening unit 103, and the second clustering unit 104 in the device are also used in the fourth embodiment. in. details as follows:

Specifically, documents are obtained from all websites by way of web crawling.

The normalization unit 101 is for standardizing the acquired documents.

Specifically, the normalization unit 101 is used for normalization of the title, including segmentation of the title, unification of the full-width half-width, removal of the punctuation of the title, and the like.

The normalization of the publication time by the normalization unit 101 includes extracting the year from the publication time. On the network, the publication time of the document has various time formats, and the normalization unit 101 can extract the year from various different time formats. Of course, except for the way only the year is extracted In addition, it is also possible to adopt a method of unifying the same expression.

The first clustering unit 102 is configured to cluster the documents of similar titles according to the similarity of the titles of the standardized documents to obtain a plurality of first sets, and in parallel according to the first author of the standardized documents, publish the source and publish The similarity of the years, clustering similar documents to obtain multiple second sets.

The first screening unit 103 is configured to calculate the similarity of the documents in each first set, select a plurality of first sets that meet the conditions according to the similarity of the calculated documents, and calculate the documents in each second set. Similarity, a plurality of eligible second sets are screened according to the similarity of the calculated documents.

Specifically, the first screening unit 103 is configured to: in each of the first set and the second set, calculate a similarity of each document in each of the first set and the second set according to a weight corresponding to each document set in advance And determining, in the first set or the second set, the similarity of each document is determined to be a first set or a second set that meets the condition.

The second clustering unit 104 is configured to cluster the plurality of selected first sets that are matched and the plurality of selected second sets that are selected, perform the same document, and perform the same document publishing source. Summary. Links to the publication sources of the same literature can be summarized.

Specifically, the second clustering unit 104 is configured to: perform a key value pair forming process for each of the selected first set and the second set that are selected, respectively, in an eligible first set or second set The key value pair formation process includes: respectively, each document is used as a key, and other documents are used as values corresponding to the key, thereby forming at least two key-value pairs; according to all the key-value pairs obtained, the key-value pairs having the same key Clustering to a set; respectively, for the obtained set, proceeding to execute the key-value pair forming process until a preset iteration is reached The number of times, the selected plurality of eligible first and second sets of the same documents are aggregated into one class.

Preferably, the similarity of the titles of the documents is determined according to the similarity between the title signatures of the documents and the Hamming distance between the titles of the documents. The author of the document, the source of publication and the similarity of the year of publication are determined by combining the author of the standardized document, the source of publication and the year of publication into a string, calculating the signature of the combined string, and then merging according to the literature. The similarity between the signatures of the subsequent strings and the Hamming distance between the merged strings of the documents are determined. As shown in FIG. 9, the signature calculation unit 1020, the signature clustering unit 1021, the distance calculation unit 1022, and the second screening unit 1023 in the first clustering unit 102 are also used in the following embodiments. details as follows:

The signature calculation unit 1020 is configured to calculate a signature of the title of the document according to the title of the standardized document, and merge the first author, the publication source, and the publication year of the standardized document into A string that evaluates the signature of the merged string.

The signature clustering unit 1021 is configured to cluster two documents with similar titles according to the signature of the title of each document to obtain a plurality of first clusters, and according to the signature of the merged character string of each document, The merged strings are similarly clustered to obtain a plurality of second clusters. The first cluster or the second cluster includes at least two documents.

Specifically, the signature clustering unit 1021 is configured to: perform a key-value pair forming process on the signature of any one of the titles, first divide the signature of the title into T-numbers, and the T is a preset value, and each of the titles The block is used as the key, and the signature of the title is used as the value, so that the title corresponds to T key-value pairs. According to the above method, each title will correspond to T key-value pairs. When there are at least one key in the T key-value pairs corresponding to the two titles, the documents corresponding to the two titles are clustered into a first cluster output. Similarly, the above method is performed for the signature of the merged character string of each document. When at least one of the T key-value pairs corresponding to the two merged strings is the same, the documents corresponding to the two merged strings are clustered into a second cluster output.

The documents with similar titles can be clustered by using the mapreduce model to obtain a plurality of first clusters, and the mapreduce model includes a map phase and a reduce phase. The input data is processed by the map, and then subjected to reduce processing to finally obtain the output data. The output of the map phase is in the form of a key-value pair. The T blocks of each title are respectively input as the map stage, and the T key-value pairs corresponding to each title are output in the map stage. In the reduce phase, when at least one key is the same among the T key-value pairs corresponding to the two titles, the reduce stage clusters the documents corresponding to the two titles into a first cluster output. Similarly, the mapreduce model can be used to cluster two merged strings with similar documents to obtain multiple second clusters.

The distance calculation unit 1022 is configured to calculate a Hamming distance between each document in each first cluster according to a signature of a title of each document in each first cluster, and a combined character according to each document in each second cluster The signature of the string calculates the Hamming distance between the documents in each second cluster.

The second screening unit 1023 is configured to filter out a first cluster whose Hamming distance is less than or equal to a preset threshold, and the plurality of selected first clusters that are selected are clusters obtained by clustering documents of similar titles. a set, and filtering out a second cluster whose Hamming distance is less than or equal to a preset threshold, and the selected second clusters that are selected are clustered by the documents corresponding to the similar merged character strings. Multiple second collections.

As shown in FIG. 10, the extracting unit 10201, the determining unit 10202, and the calculating unit 10203 in the signature calculation unit 1020 are also used in the following embodiments. details as follows:

The extracting unit 10201 is further configured to divide the merged character string into a plurality of substrings, calculate a length of each substring, and extract a substring whose length of the substring is greater than a preset length.

The determining unit 10202 is further configured to determine an n-gram feature of the extracted substring.

The calculating unit 10203 is configured to calculate a signature of a title of the document according to the determined n-gram feature. The signature of the title of the document can be calculated using the simhash algorithm, and the signature of the title of the calculated document is an n-bit signature consisting of 0 and 1.

The calculating unit 10203 is further configured to determine an n-gram feature of the extracted substring, and calculate a signature of the merged character string of the document. The simhash algorithm can be used to calculate the signature of the merged string. The calculated signature of the merged string is an n-bit signature consisting of 0 and 1.

In the above four embodiments, the first set and the second set are only differences in expression for distinguishing between the sets of documents obtained in the two ways.

In other embodiments, the apparatus for searching using the document normalization method in the first embodiment or the second embodiment, as shown in FIG. 11, includes: a receiving unit 200, a matching unit 201, and a presentation unit 202.

The receiving unit 200 is configured to receive a keyword input by a user.

The matching unit 201 is configured to match all the documents associated with the keyword according to the keyword.

The presentation unit 202 is configured to send the published source of all the associated documents and each associated document to the user.

In the several embodiments provided by the present invention, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative, for example, the division of the elements is merely a logical functional division, There are other ways of dividing the actual implementation.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional units.

The above-described integrated unit implemented in the form of a software functional unit can be stored in a computer readable storage medium. The above software functional unit is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform the methods of the various embodiments of the present invention. Part of the steps. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .

The above are only the preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalents, improvements, etc., which are made within the spirit and principles of the present invention, should be included in the present invention. Within the scope of protection.

Claims

A document normalization method, comprising:

Obtain documents from more than one website source;

Standardize the documents obtained;

According to the similarity of the titles of the standardized documents, the documents of similar titles are clustered to obtain a plurality of document collections;

Calculating the similarity of the documents in each document collection, and filtering out the qualified document collection according to the similarity of the calculated documents;

For the selected set of qualified documents, cluster the same documents, and summarize the publication sources of the same documents.
The method of claim 1 wherein the similarity of the titles of the documents is determined in at least one of the following manners:

Calculating the signature for the title of the document and calculating the similarity between the title signatures of the document;

The Hamming distance between the titles of the documents is calculated, and the similarity between the titles of the documents is determined according to the Hamming distance.
The method according to claim 1, wherein before the calculating the similarity of the document in each document collection, the method further comprises:

According to the author of the standardized literature, the similarity of at least one attribute in the source and the publication year is published, and similar documents are clustered to obtain a plurality of document collections.
The method according to claim 3, wherein the similarity of at least one of the publication source and the publication year is determined according to at least one of the following manners according to the author of the standardized document:

Combine the author of the standardized document, the source of the publication, and the year of publication into a string Calculating the signature of the merged string and calculating the similarity between the signatures of the merged strings of the document;

The authors of the standardized literature, the source of publication and the year of publication are combined into a string, the Hamming distance between the merged strings is calculated, and the author of the document, the source of the publication, and the similarity of the publication year are determined according to the Hamming distance.
The method according to claim 1, wherein after obtaining the plurality of document collections and calculating the similarity of the documents in each document collection, the method further comprises:

Based on the Hamming distance between the documents in the collection of documents, a collection of documents whose Hamming distance is less than or equal to a preset threshold is screened.
The method according to claim 1, wherein the screening of the qualified document collection according to the similarity of the calculated documents comprises:

In each document collection, the similarity between each document in each document collection is calculated according to the weight corresponding to each document attribute set in advance, and the document collection with the similarity greater than the preset total score among the documents is determined to be consistent. A collection of documents for conditions.
The method according to claim 1, wherein said clustering of the selected documents for the selected set of qualified documents comprises:

Performing a key-value pair forming process for each of the selected document sets that are selected, the key-value pair forming process includes: respectively, each document is used as a key, and other documents are used as values corresponding to the key, thereby forming at least two keys. -value pair;

Clustering the same key-value pairs of keys to a set based on all the key-value pairs obtained;

The key set formation process is performed separately for the obtained set until the preset number of iterations is reached.
The method of claim 1 wherein said standardizing comprises:

Performing word segmentation on the full name of the first author of the document, extracting the first letter of each word, and using the extracted initial combination as the author of the standardized document; or

Extract the longest sentence in the body part of the document abstract and calculate the signature of the longest sentence; or,

Uniform literature source format; or,

The format of the publication time of the document is unified, or only the year in which the publication of the document is published.
The method of claim 2 wherein said calculating a signature for a title of the document comprises:

Dividing the title of the document into a plurality of subtitles, calculating a length of each subtitle, and extracting subtitles whose subtitles are longer than a preset length;

Determining an n-gram feature of the extracted subtitle, wherein the value of n is a positive integer from 1 to N, and the N is a preset positive integer;

The signature of the title of the document is calculated based on the determined n-gram characteristics.
A document search method, characterized in that the method comprises:

Receiving keywords input by the user;

Searching for documents associated with the keyword based on the keyword;

In the search results, the same documents are aggregated and displayed, and the publication sources of each document are displayed;

Wherein the same document is normalized using the method of any of claims 1 to 9.
A document normalization device, comprising:

An acquisition unit for obtaining documents from more than one website source;

Standardization unit for standardizing the acquired documents;

a first clustering unit, configured to cluster the documents of similar titles according to the similarity of the titles of the standardized documents to obtain a plurality of document collections;

a first screening unit, configured to calculate a similarity degree of the document in each document collection, and select a qualified document collection according to the similarity of the calculated documents;

The second clustering unit is configured to perform clustering of the same documents on the selected qualified document collections, and summarize the publication sources of the same documents.
The apparatus according to claim 11, wherein said first clustering unit determines the similarity of the titles of the documents in at least one of the following manners:

Calculating the signature for the title of the document and calculating the similarity between the title signatures of the document;

The Hamming distance between the titles of the documents is calculated, and the similarity between the titles of the documents is determined according to the Hamming distance.
The apparatus according to claim 11, wherein the first clustering unit is further configured to publish a source according to the author of the standardized document before calculating the similarity of the document in each document set. The similarity of at least one attribute in the publication year, clustering similar documents to obtain a plurality of document collections.
The apparatus according to claim 13, wherein the first clustering unit determines the similarity of the at least one attribute in at least one of the following manners:

Combining the author of the standardized document, the source of publication, and the year of publication into a string, calculating the signature of the merged string, and calculating the similarity between the signatures of the merged strings of the document;

Combine the authors of the standardized literature, the source of publication, and the year of publication into a string, calculate the Hamming distance between the merged strings, and determine the author of the document based on the Hamming distance. The similarity between the source of the table and the year of publication.
The device according to claim 11, further comprising:

a second screening unit, configured to: after obtaining the plurality of document collections, and calculating the similarity of the documents in each document collection, based on the Hamming distance between the documents in the collection of documents, the Hamming distance is selected to be less than or equal to a preset threshold Collection of documents.
The apparatus according to claim 11, wherein the first screening unit is configured to calculate, in each document collection, each document in each document collection according to a weight corresponding to each document attribute set in advance. The similarity between the documents is determined as a set of documents that meet the criteria for a collection of documents whose similarities between the documents are greater than the preset total score.
The apparatus according to claim 11, wherein the second clustering unit performs the clustering of the same document on the selected set of qualified documents, and specifically executes:

Performing a key-value pair forming process for each of the selected document sets that are selected, the key-value pair forming process includes: respectively, each document is used as a key, and other documents are used as values corresponding to the key, thereby forming at least two keys. -value pair;

Clustering the same key-value pairs of keys to a set based on all the key-value pairs obtained;

The key set formation process is performed separately for the obtained set until the preset number of iterations is reached.
The device according to claim 11, wherein the normalization unit is specifically configured to:

Performing word segmentation on the full name of the first author of the document, extracting the first letter of each word, and using the extracted initial combination as the author of the standardized document; or

Extracting the longest sentence in the main part of the document abstract, and calculating the signature of the longest sentence; or,

Uniform literature source format; or,

The format of the publication time of the document is unified, or only the year in which the publication of the document is published.
The apparatus according to claim 12, wherein the first clustering unit performs: when calculating a signature for a title of the document, specifically:

Dividing the title of the document into a plurality of subtitles, calculating a length of each subtitle, and extracting subtitles whose subtitles are longer than a preset length;

Determining an n-gram feature of the extracted subtitle, the value of n being a positive integer from 1 to N, the N being a preset positive integer;

The signature of the title of the document is calculated based on the determined n-gram characteristics.
A document search device, characterized in that the device comprises:

a receiving unit, configured to receive a keyword input by a user;

a matching unit, configured to search for a document associated with the keyword according to the keyword;

A presentation unit for synthesizing the same documents in the search results and presenting the publication sources of the respective documents, wherein the same documents are normalized using the apparatus according to any one of claims 11 to 19.
a device, including

One or more processors;

Memory

One or more programs, the one or more programs being stored in the memory, when executed by the one or more processors:

Obtain documents from more than one website source;

Standardize the documents obtained;

According to the similarity of the titles of the standardized documents, the documents of similar titles are clustered to obtain a plurality of document collections;

Calculating the similarity of the documents in each document collection, and filtering out the qualified document collection according to the similarity of the calculated documents;

For the selected set of qualified documents, cluster the same documents, and summarize the publication sources of the same documents.
a device, including

One or more processors;

Memory

One or more programs, the one or more programs being stored in the memory, when executed by the one or more processors:

Receiving keywords input by the user;

Searching for documents associated with the keyword based on the keyword;

In the search results, the same documents are aggregated and displayed, and the publication sources of each document are displayed;

Wherein the same document is normalized using the method of any of claims 1 to 9.
A computer storage medium encoded with a computer program, when executed by one or more computers, causes the one or more computers to perform the following operations:

Obtain documents from more than one website source;

Standardize the documents obtained;

According to the similarity of the titles of the standardized documents, the documents of similar titles are clustered to obtain a plurality of document collections;

Calculating the similarity of the documents in each document collection, and filtering out the qualified document collection according to the similarity of the calculated documents;

For the selected set of qualified documents, cluster the same documents, and summarize the publication sources of the same documents.
A computer storage medium encoded with a computer program, when executed by one or more computers, causes the one or more computers to perform the following operations:

Receiving keywords input by the user;

Searching for documents associated with the keyword based on the keyword;

In the search results, the same documents are aggregated and displayed, and the publication sources of each document are displayed;

Wherein the same document is normalized using the method of any of claims 1 to 9.