WO2017096777A1 - Document normalization method, document searching method, corresponding apparatuses, device, and storage medium - Google Patents

Document normalization method, document searching method, corresponding apparatuses, device, and storage medium Download PDF

Info

Publication number
WO2017096777A1
WO2017096777A1 PCT/CN2016/087058 CN2016087058W WO2017096777A1 WO 2017096777 A1 WO2017096777 A1 WO 2017096777A1 CN 2016087058 W CN2016087058 W CN 2016087058W WO 2017096777 A1 WO2017096777 A1 WO 2017096777A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
documents
similarity
publication
key
Prior art date
Application number
PCT/CN2016/087058
Other languages
French (fr)
Chinese (zh)
Inventor
黄岳
马晋
张显
张晓婧
曹冰
徐学睿
李玉鹏
杰艺
Original Assignee
百度在线网络技术(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百度在线网络技术(北京)有限公司 filed Critical 百度在线网络技术(北京)有限公司
Publication of WO2017096777A1 publication Critical patent/WO2017096777A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Definitions

  • the present invention relates to the field of computer application technologies, and in particular, to a document normalization method, a document search method, and corresponding devices, devices, and storage media.
  • a certain document may have multiple sources of electronic sources, and the data quality of each electronic source channel is different.
  • the user cannot obtain all the electronic sources of the same document, and can only search. Viewing a source from a source is not conducive to filtering quality and licensed resources, reducing the user experience.
  • the invention provides a document normalization method, a literature search method and a corresponding device, so as to achieve the normalization of the same document, and provide a basis for improving the effect of the literature search.
  • a document normalization method including:
  • the documents of similar titles are clustered to obtain a plurality of document collections;
  • the similarity of the titles of the documents is determined in at least one of the following ways:
  • the Hamming distance between the titles of the documents is calculated, and the similarity between the titles of the documents is determined according to the Hamming distance.
  • the method before the calculating the similarity of the document in each document collection, the method further comprises:
  • the similarity of at least one attribute in the source and the publication year is published, and similar documents are clustered to obtain a plurality of document collections.
  • the similarity of at least one of the publication source and the publication year is determined by at least one of the following methods:
  • the authors of the standardized literature, the source of publication and the year of publication are combined into a string, the Hamming distance between the merged strings is calculated, and the author of the document, the source of the publication, and the similarity of the publication year are determined according to the Hamming distance.
  • the method further comprises:
  • a collection of documents whose Hamming distance is less than or equal to a preset threshold is screened.
  • the screening of the qualified document collection according to the similarity of the calculated documents comprises:
  • each document collection the similarity between each document in each document collection is calculated according to the weight corresponding to each document attribute set in advance, and the document collection with the similarity greater than the preset total score among the documents is determined to be consistent.
  • the clustering of the selected documents is performed on the selected set of qualified documents, including:
  • the key-value pair forming process includes: respectively, each document is used as a key, and other documents are used as values corresponding to the key, thereby forming at least two keys. -value pair;
  • the key set formation process is performed separately for the obtained set until the preset number of iterations is reached.
  • the standardization comprises:
  • Extract the longest sentence in the body part of the document abstract and calculate the signature of the longest sentence
  • the format of the publication time of the document is unified, or only the year in which the publication of the document is published.
  • the calculating a signature for the title of the document comprises:
  • n-gram feature of the extracted subtitle wherein the value of n is a positive integer from 1 to N, and the N is a preset positive integer;
  • the signature of the title of the document is calculated based on the determined n-gram characteristics.
  • a document search method comprising:
  • a document normalization device comprising:
  • Standardization unit for standardizing the acquired documents
  • a first clustering unit configured to cluster the documents of similar titles according to the similarity of the titles of the standardized documents to obtain a plurality of document collections
  • a first screening unit for calculating the similarity of the document in each document collection, according to the Calculating the similarity of the documents to screen out a set of qualified documents
  • the second clustering unit is configured to perform clustering of the same documents on the selected qualified document collections, and summarize the publication sources of the same documents.
  • the first clustering unit determines the similarity of the titles of the documents in at least one of the following manners:
  • the Hamming distance between the titles of the documents is calculated, and the similarity between the titles of the documents is determined according to the Hamming distance.
  • the first clustering unit is further configured to: before the calculating the similarity of the document in each document collection, according to the author of the standardized document, at least the publication source and the publication year A similarity of attributes, clustering similar documents to obtain multiple sets of documents.
  • the first clustering unit determines the similarity of the at least one attribute in at least one of the following manners:
  • the authors of the standardized literature, the source of publication and the year of publication are combined into a string, the Hamming distance between the merged strings is calculated, and the author of the document, the source of the publication, and the similarity of the publication year are determined according to the Hamming distance.
  • the method further includes:
  • a second screening unit configured to filter out the Hamming distance between the documents in the document collection after obtaining the plurality of document collections and calculating the similarity of the documents in each document collection A collection of documents whose Hamming distance is less than or equal to a preset threshold.
  • the first screening unit is specifically configured to: in each document set, calculate a similarity between each document in each document collection according to a weight corresponding to each document attribute set in advance, A collection of documents in which the similarity between the documents is greater than the preset total score is determined as a set of qualified documents.
  • the second clustering unit when the second clustering unit performs clustering of the same document on the selected set of qualified documents, the second clustering unit performs:
  • the key-value pair forming process includes: respectively, each document is used as a key, and other documents are used as values corresponding to the key, thereby forming at least two keys. -value pair;
  • the key set formation process is performed separately for the obtained set until the preset number of iterations is reached.
  • the standardization unit is specifically configured to:
  • Extract the longest sentence in the body part of the document abstract and calculate the signature of the longest sentence
  • the format of the publication time of the document is unified, or only the year in which the publication of the document is published.
  • the specific execution is:
  • n a positive integer from 1 to N
  • N a preset positive integer
  • the signature of the title of the document is calculated based on the determined n-gram characteristics.
  • a document search device comprising:
  • a receiving unit configured to receive a keyword input by a user
  • a matching unit configured to search for a document associated with the keyword according to the keyword
  • a presentation unit for synthesizing the same documents in the search results and presenting the publication sources of the respective documents, wherein the same documents are normalized using the device of the document normalization.
  • the present invention can accurately aggregate the same documents together and clearly provide the source of the literature.
  • the different publication sources of the same document can be brought together and presented to the user. Improved user experience.
  • Figure 1 is a schematic diagram of a search document in the prior art.
  • FIG. 2 is a flow chart of a method for normalizing documents according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of standardization of an author in an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of clustering the same document according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a search result presentation provided by an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of signature processing of two titles in the reduce phase in the embodiment of the present invention.
  • FIG. 7 is a flow chart of another method for normalizing documents according to an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of an apparatus according to an embodiment of the present invention.
  • FIG. 9 is a schematic structural diagram of an embodiment of the first clustering unit of FIG. 8.
  • FIG. 9 is a schematic structural diagram of an embodiment of the first clustering unit of FIG. 8.
  • FIG. 10 is a schematic structural diagram of an embodiment of the signature calculation unit of FIG. 8.
  • Figure 11 is a block diagram showing the structure of an apparatus for searching using the document normalization method.
  • FIG. 2 is a flow chart showing the first embodiment of the document normalization method of the present invention. As shown in Figure 2, the document normalization method includes:
  • documents are obtained from all websites by way of web crawling.
  • the normalization is to normalize the attributes of the document, the attributes of the document including title, author, abstract, publication source, publication time, and the like.
  • the standardization of the title includes the segmentation of the title, the unification of the full-width half-width, the removal of the punctuation of the title, and the like.
  • the title of a document is re:Coagulation and -Flocculation, which is re Coagulation and--Flocculation after standardization of the title.
  • the principle of standardization for the author is to extract the full name of the first author of the document, divide the full name of the first author into multiple words, extract the first letter of each word, and finally sort all the initials extracted as a document. Corresponding author. When the first author's full name is divided into multiple words, when there are multiple uppercase letters abbreviated together, each uppercase letter is divided into one word.
  • FIG. 3 it is a schematic diagram of standardization of the author in the present invention.
  • the names of an author obtained from the network are: Carlos N.Slia, Carlos Nascimento.Slia and SN Carlos.
  • Carlos N.Slia is divided into three words: Carlos, N and Slia.
  • the first letters of these three words are C, N, and S.
  • Carlos Nascimento.Slia was split into Carlos, Nascimento and Slia, taking the first letters of the three words C, N, S. SN Carlos is split into S, N, and Carlos.
  • the first letters of these three words are S, N, and C.
  • CNS in alphabetical order.
  • the principle of standardization of the abstract is to extract the main part of the abstract, calculate the length of the sentence in the main part, find the sentence with the longest length, and calculate the signature of the abstract of the document. In other embodiments, sentences of other lengths may also be used.
  • the signature of the digest of the document can be calculated using the Message Digest Algorithm (MD5).
  • Publication sources include journals, conferences, and collections.
  • the standardization of the source of publication is mainly to unify the format of the source of publication, including uniformization of uppercase and lowercase, deletion of symbols, and unification of full-width corners.
  • Standardization of publication time includes extracting year data from publication time.
  • publication time was: 1990, 1990-11-11, 1990/11/11, and the standardization of the publication time was obtained in 1999.
  • the same expression can be used, for example, the expressions 1990-11-11, 1990/11/11, November 11, 1990, 1990.11.11 are unified into 1990. -11-11.
  • the first set includes at least two documents.
  • S13 Calculate the similarity of the document in each first set, and select a plurality of first sets that meet the condition according to the similarity of the calculated documents.
  • the weight corresponding to the document attribute is set in advance, and the document attribute may be a feature such as an author, a summary, a publication source, and a publication time.
  • the similarity of each document in each first set is calculated according to the weight corresponding to the preset document attribute, and the first set whose similarity of each document is greater than the preset total score is determined to be in accordance with The first set of conditions.
  • document a there are two documents in a first set, assuming that the author has a weight of 4, the abstract weight is 2, the journal weight is 2, the publication time weight is 2, and the default total score is 5.
  • the characteristics of document a are as follows. : a General Stability Result for Viscoelastic Equations with Singular Kernels, author: MM Cavalcanti, Journal: missing, published: 1999-02-11, summary signature: b47b61cad59b93c5ad99e8820b71f4db; b literature features the following title: a General Stabilities Result for Viscoelastic Equations With Singular Kernels, author MC Murphy, Journal: Journal of Applied & Computational Mathematics, published: 1999, abstract signature: b47b61cad59b93c5ad99e8820b71f4db; document a is the same as the author of document b, the author corresponds to the value of 1 * 4, the same reason The publication a is different from the publication source of document b
  • the key value pair forming process is performed separately for each of the selected first sets that are selected, and the key value pair forming process in the first set that meets the condition includes: respectively, each document is used as a key, and other documents are used as The key corresponds to the value, thereby forming at least two key-value pairs; according to all the key-value pairs obtained, the key-value pairs with the same key are clustered into one set; respectively, the obtained set is transferred to execute the key The value pair formation process until a preset number of iterations is reached, the preset number of iterations being an empirical value.
  • the mapreduce model may be used to cluster the same documents in the plurality of eligible first sets that are selected. Specifically, each of the selected first sets that are filtered out is used as an input of a map stage, and a key-value pair corresponding to each of the eligible first sets is output in the map stage. All the key-value pairs corresponding to the first selected first set are sorted, and all the key-value pairs after sorting are used as the input data of the reduce stage, and the keys with the same key are in the reduce stage. The value pairs are clustered into a set, so the reduce stage outputs multiple sets, and the documents in each set form multiple key-value pairs as the input of the reduce stage.
  • the above method is used to iterate multiple times until the preset is reached. The number of iterations, the same documents in the selected plurality of eligible first sets are aggregated into one class, and all the publication sources of the document are included in the class.
  • each of the eligible first sets that are selected includes two documents
  • the plurality of selected first sets that are selected are (a, b), (b, c), (d, f)
  • the key-value pairs output by the plurality of eligible first sets that are filtered in the map stage are ab, ba, bc, cb, df, fd.
  • the literature consists of multiple key-value pairs as input to the map stage. So many iterations can get (a, b, c) as a class and (d, f) as a class.
  • a search method using the document normalization method in the embodiment includes receiving a keyword input by a user; and matching, according to the keyword, all the documents associated with the keyword Send the published source of all associated documents and each associated document to the user. Specifically, a link to the publication source of each of the associated documents is displayed to the user. In this way, the user brings together the different publishing source links of the same article, which improves the user experience.
  • the present invention is a schematic diagram in which the publication sources of the same document are gathered together. Compared with the content shown in FIG. 1, the same document as "simulation study on angle measurement accuracy of star sensor" in FIG.
  • the sources of the same literature include: ReserchGate, SPIE, reviews.spiedigita, the same documents from these sources were The aggregation is presented and the sources are shown for user selection.
  • the similarity of the titles of the documents is determined according to the similarity between the title signatures of the documents and the Hamming distance between the titles of the documents, and S12 includes:
  • the documents with similar titles are clustered to obtain a plurality of first clusters.
  • the first cluster includes at least two documents.
  • the key value pair forming process is performed on the signature of any one of the titles, and the title is first
  • the signature is divided into T parts, the T is a preset value, each block of the title is used as a key, and the signature of the title is used as a value, so that the title corresponds to T key-value pairs.
  • each title will correspond to T key-value pairs.
  • the documents with similar titles can be clustered by using the mapreduce model to obtain a plurality of first clusters, and the mapreduce model includes a map phase and a reduce phase.
  • the input data is processed by the map, and then subjected to reduce processing to finally obtain the output data.
  • the output of the map phase is in the form of a key-value pair.
  • the T blocks of each title are respectively input as the map stage, and the T key-value pairs corresponding to each title are output in the map stage.
  • the reduce stage clusters the documents corresponding to the two titles into a first cluster output.
  • the signature of the title of the document a is 111111000100100, divided into four parts of 1111, 1110, 0010, 0100, and the signature of the title of the document b is 1101111000000000 divided into four parts of 1101, 1110,000, 0000, from
  • the second block of the signature of the title of document a is identical to the second block of the signature of the title of document b.
  • the document a and the document b are clustered into a first cluster.
  • the Hamming distance is 1. If there are different values in the two positions, then the Hamming distance is 2, and so on.
  • the signature of the title of document a is 111111000100100
  • the signature of the title of document b is 1101111000000000
  • the third digit of document a and document b the eleventh digit
  • the 14th digit The number is different
  • the Hamming distance between the document a and the document b is 3.
  • S123 Filter out a first cluster whose Hamming distance is less than or equal to a preset threshold, and select a plurality of eligible first clusters to be a plurality of first sets obtained by clustering documents of similar titles.
  • the similarity of the titles of the documents may be determined based on the similarity between the title signatures of the documents or the Hamming distance between the titles of the documents.
  • the method further includes:
  • the preset length is 10 characters
  • the standardized title is R Genre classification via an lz78-based string kernel, and then divided into R and Genre classification via an lz78-based string kernel.
  • R is 1 character and its length is less than 10 characters, so R is excluded.
  • the title is "A B C", and if m is 3, the title of the document is characterized by [A, B, C, AB, BC, ABC].
  • the signature of the title of the document can be calculated using the simhash algorithm, and the signature of the title of the document is calculated to be an n-bit signature consisting of 0 and 1. For example, a 64-bit signature, a 16-bit signature, and the like.
  • FIG. 7 it is a schematic flowchart of Embodiment 2 of the normalization method of the document of the present invention, and the normalization method of the document includes:
  • documents are obtained from all websites by way of web crawling.
  • the normalization is to normalize the attributes of the document, the attributes of the document including title, author, abstract, publication source, publication time, and the like.
  • the standardization of the title includes the segmentation of the title, the unification of the full-width half-width, the removal of the punctuation of the title, and the like.
  • the principle of standardization for the author is to extract the full name of the first author of the document, divide the full name of the first author into multiple words, extract the first letter of each word, and finally sort all the initials extracted into the corresponding documents. author.
  • first author's full name is divided into multiple words, when there are multiple uppercase letters abbreviated together, each uppercase letter is divided into one word.
  • the principle of standardization of the abstract is to extract the main part of the abstract, calculate the length of the sentence in the main part, find the sentence with the longest length, and calculate the signature of the abstract of the document. In other embodiments, sentences of other lengths may also be used.
  • the signature of the digest of the document can be calculated using the Message Digest Algorithm (MD5).
  • Publication sources include journals, conferences, and collections.
  • the standardization of the source of publication is mainly to unify the format of the source of publication, including uniformization of uppercase and lowercase, deletion of symbols, and unification of full-width corners.
  • Standardization of publication time includes the extraction of the year from the publication time.
  • the publication time will be in a variety of time formats, and the standardization of publication time includes the extraction of years from a variety of different time formats.
  • the standardization of publication time includes the extraction of years from a variety of different time formats.
  • the weight corresponding to the document attribute is set in advance, and the document attribute may be a feature such as an author, a summary, a publication source, and a publication time.
  • the similarity of each document in each of the first set and the second set is calculated according to the weight corresponding to each document set in advance, and the similarity of each document is greater than the preset total
  • the first set or the second set is determined to be an eligible first set or second set.
  • S24 Perform clustering of the same document on the plurality of selected first sets that are selected and the plurality of selected second sets that are selected, and summarize the publication sources of the same documents. Links to the publication sources of the same literature can be summarized.
  • the key value pair forming process is performed separately for each of the selected first set and the second set that are selected, and the forming process of the key value pair in the first set or the second set that meets the condition includes: respectively Each document is used as a key, and other documents are used as values corresponding to the key, thereby forming at least two key-value pairs; according to all the key-value pairs obtained, the key-value pairs having the same key are clustered into one set; The collection proceeds to perform the key-value pair formation process until the preset number of iterations is reached, and the plurality of selected matches are The same set of conditions and the same documents in the second set are aggregated into one class.
  • the preset number of iterations is an empirical value.
  • the mapreduce model may be used to cluster the plurality of selected first sets and the same documents in the second set. Specifically, each of the selected first set and the second set are selected as input of a map stage, and a key-value pair corresponding to each of the first set and the second set that meets the condition is output in the map stage. All the selected key-value pairs corresponding to the first set and the second set are sorted, and all sorted key-value pairs are used as input data of the reduce stage, and the key is used in the reduce stage.
  • the same key-value pairs are clustered into a set, so that the reduce stage outputs multiple sets, and the documents in each set form a plurality of key-value pairs as input of the reduce stage, and are iterated multiple times by using the above method until reaching
  • the preset number of iterations aggregates the selected plurality of eligible first and second collections into a class, including all publication sources of the document.
  • a search method using the document normalization method in the embodiment includes receiving a keyword input by a user; and matching, according to the keyword, all the documents associated with the keyword Send the published source of all associated documents and each associated document to the user. Specifically, a link to the publication source of each of the associated documents is displayed to the user. In this way, the user brings together the different publishing source links of the same article, which improves the user experience.
  • the similarity of the titles of the documents is determined according to the similarity between the title signatures of the documents and the Hamming distance between the titles of the documents.
  • the author of the document, the source of the publication and the similarity of the publication year are determined by combining the author of the standardized document, the publication source and the publication year into a string, and calculating the combined string.
  • the signature is then determined based on the similarity between the signatures of the merged strings of the document and the Hamming distance between the merged strings of the document. Then S22 includes:
  • S220 Calculate the signature of the title of the document according to the title of the standardized document, and combine the first author of the standardized document, the publication source and the publication year into a character string, and calculate the signature of the merged character string.
  • S221 according to the signature of the title of each document, clustering two documents with similar titles to obtain a plurality of first clusters, and combining the two merged characters according to the signature of the merged string of each document.
  • Strings of similar documents are clustered to obtain a plurality of second clusters.
  • the first cluster or the second cluster includes at least two documents.
  • the key value pair forming process is performed on the signature of any one of the titles, and the signature of the title is first divided into T parts, and the T is a preset value, and each piece of the title is used as a key, and the title is Signature as value, so the title will correspond to T key-value pairs.
  • each title will correspond to T key-value pairs.
  • the documents corresponding to the two titles are clustered into a first cluster output.
  • the above method is performed for the signature of the merged character string of each document.
  • the documents corresponding to the two merged strings are clustered into a second cluster output.
  • the documents with similar titles can be clustered by using the mapreduce model to obtain a plurality of first clusters, and the mapreduce model includes a map phase and a reduce phase.
  • Input data The map is processed, and then processed by reduce, and finally the output data is obtained.
  • the output of the map phase is in the form of a key-value pair.
  • the T blocks of each title are respectively input as the map stage, and the T key-value pairs corresponding to each title are output in the map stage.
  • the reduce stage clusters the documents corresponding to the two titles into a first cluster output.
  • the mapreduce model can be used to cluster two merged strings with similar documents to obtain multiple second clusters.
  • the Hamming distance is 1. If there are different values in the two positions, then the Hamming distance is 2, and so on. If the corresponding values at a certain position of the signature of the two merged strings are different, the Hamming distance is 1. If there are different values in the two positions, then the Hamming distance is 2, and so on.
  • S223 Filter out a first cluster whose Hamming distance is less than or equal to a preset threshold, and select a plurality of first clusters that meet the condition to be a plurality of first sets obtained by clustering documents of similar titles, and screening The second cluster with the Hamming distance less than or equal to the preset threshold is selected, and the selected second clusters that are selected are the plurality of second obtained by clustering the documents corresponding to the similar merged character strings. set.
  • the similarity of the titles of the documents may be determined based on the similarity between the title signatures of the documents or the Hamming distance between the titles of the documents.
  • the author of the document, the publication source and the degree of similarity of the publication year may be determined based on the similarity between the signatures of the merged character strings of the document or the Hamming distance between the merged strings of the documents.
  • the method further includes:
  • the signature of the title of the document can be calculated using the simhash algorithm, and the signature of the title of the calculated document is an n-bit signature consisting of 0 and 1.
  • the simhash algorithm can be used to calculate the signature of the merged string.
  • the calculated signature of the merged string is an n-bit signature consisting of 0 and 1.
  • Figure 8 is a schematic block diagram showing the construction of an embodiment of the apparatus of the present invention.
  • the apparatus includes: an obtaining unit 100, a normalization unit 101, a first clustering unit 102, a first screening unit 103, and a second clustering unit 104.
  • the obtaining unit 100 is configured to obtain documents of all website sources.
  • documents are obtained from all websites by way of web crawling.
  • the normalization unit 101 is for standardizing the acquired documents.
  • the standardization is to standardize attributes of a document, and the attributes of the document include a title, an author, an abstract, a publication source, a publication time, and the like. Normalization.
  • the normalization of the title by the normalization unit 101 includes unification of the full-width of the segmentation half-width of the title, removal of the punctuation of the title, and the like.
  • the principle of the standardization unit 101 standardizing the author is to extract the full name of the first author of the document, divide the full name of the first author into a plurality of words, extract the initials of each word, and finally sort all the initials extracted into The author of the document.
  • first author's full name is divided into multiple words, when there are multiple uppercase letters abbreviated together, each uppercase letter is divided into one word.
  • the principle of standardization of the abstract by the normalization unit 101 is to extract the main part of the abstract, calculate the length of the sentence in the main part, find the sentence with the longest length, and calculate the signature of the abstract of the document. In other embodiments, sentences of other lengths may also be used.
  • the signature of the digest of the document can be calculated using the Message Digest Algorithm (MD5).
  • the standardization unit 101 standardizes the source of the publication mainly by the format of the unified publication source, including unified capitalization, deletion of symbols, and unification of the full-width half-width.
  • the normalization of the publication time by the normalization unit 101 includes extracting the year from the publication time.
  • the publication time of the document has various time formats, and the normalization unit 101 can extract the year from various different time formats.
  • the normalization unit 101 can extract the year from various different time formats.
  • the first clustering unit 102 is configured to cluster the documents of similar titles according to the similarity of the titles of the standardized documents to obtain a plurality of first sets.
  • the first set includes at least two documents.
  • the first screening unit 103 is configured to calculate the similarity of the documents in each of the first sets, according to The similarity of the calculated documents filters out a plurality of first sets that meet the criteria.
  • the first screening unit 103 is configured to: preset a weight corresponding to the document attribute, where the document attribute may be an author, a summary, a publication source, a publication time, and the like.
  • the similarity of each document in each first set is calculated according to the weight corresponding to the preset document attribute, and the first set whose similarity of each document is greater than the preset total score is determined to be in accordance with The first set of conditions.
  • the second clustering unit 104 is configured to cluster the same documents in the selected plurality of eligible first sets, and summarize the publication sources of the same documents. Links to the publication sources of the same literature can be summarized.
  • the second clustering unit 104 is configured to: perform a key-value pair forming process for each of the selected first sets that are selected, and the key-value pair forming process in the first set that meets the condition includes: respectively Each document is used as a key, and other documents are used as the value corresponding to the key, thereby forming at least two key-value pairs; according to all the key-value pairs obtained, the key-value pairs having the same key are clustered into one set; The obtained set is transferred to perform the key value pair forming process until the preset number of iterations is reached, and the same documents in the selected plurality of eligible first sets are aggregated into one class.
  • the mapreduce model may be used to cluster the same documents in the plurality of eligible first sets that are selected. Specifically, each of the selected first sets that are filtered out is used as an input of a map stage, and a key-value pair corresponding to each of the eligible first sets is output in the map stage. All the key-value pairs corresponding to the first selected first set are sorted, and all the key-value pairs after sorting are used as the input data of the reduce stage, and the keys with the same key are in the reduce stage. The value pairs are clustered into a collection, so the reduce phase will output multiple collections, and the documents in each collection will be composed more.
  • the key-value pair is used as the input of the reduce phase, and is iterated multiple times by using the above method until the preset number of iterations is reached, and the same documents in the selected plurality of eligible first sets are aggregated into one class, in the class. Includes all publication sources for this article.
  • the first clustering unit 102 includes a signature calculation unit 1020, a signature clustering unit 1021, a distance calculation unit 1022, and a second screening unit 1023.
  • the signature calculation unit 1020 is configured to calculate a signature of the title of the document based on the title of the standardized document.
  • the signature clustering unit 1021 is configured to cluster two documents with similar titles according to the signature of the title of each document to obtain a plurality of first clusters.
  • the first cluster includes at least two documents.
  • the signature clustering unit 1021 is configured to: perform a key-value pair forming process on the signature of any one of the titles, first divide the signature of the title into T-numbers, and the T is a preset value, and each of the titles The block is used as the key, and the signature of the title is used as the value, so that the title corresponds to T key-value pairs. According to the above method, each title will correspond to T key-value pairs. When there are at least one key in the T key-value pairs corresponding to the two titles, the documents corresponding to the two titles are clustered into a first cluster output.
  • the documents with similar titles can be clustered by using the mapreduce model to obtain a plurality of first clusters, and the mapreduce model includes a map phase and a reduce phase.
  • the input data is processed by the map, and then subjected to reduce processing to finally obtain the output data.
  • the output of the map phase is in the form of a key-value pair.
  • the T blocks of each title are respectively input as the map stage, and the T key-value pairs corresponding to each title are output in the map stage.
  • the reduce stage clusters the documents corresponding to the two titles into a first cluster output.
  • the distance calculation unit 1022 is configured to calculate a Hamming distance between documents in each of the first clusters based on the signature of the title of the document in each of the first clusters.
  • the second screening unit 1023 selects a first cluster whose Hamming distance is less than or equal to a preset threshold, and the selected plurality of eligible first clusters are a plurality of first sets obtained by clustering documents of similar titles. .
  • the similarity of the titles of the documents is determined based on the similarity between the title signatures of the documents and the Hamming distance between the titles of the documents. In other embodiments, the similarity of the titles of the documents may be determined based on the similarity between the title signatures of the documents or the Hamming distance between the titles of the documents.
  • FIG. 10 is a schematic structural diagram of an embodiment of a signature calculation unit of the present invention.
  • the signature calculation form 1020 includes an extraction unit 10201, a determination unit 10202, and a calculation unit 10203.
  • the extracting unit 10201 is configured to divide the title of the document into a plurality of subtitles, such as may be divided into uppercase letters. Calculate the length of each subtitle, and extract subtitles whose subtitles are longer than the preset length.
  • the determining unit 10202 is configured to determine an n-gram feature of the extracted subtitle, the value of the n is from 1 to N, and the value of the N is set according to the length of the extracted subtitle.
  • the calculating unit 10203 is configured to calculate a signature of a title of the document according to the determined n-gram feature.
  • the signature of the title of the document can be calculated using the simhash algorithm, and the signature of the title of the document is calculated to be an n-bit signature consisting of 0 and 1. For example, it is a 64-bit signature, a 16-bit signature, and the like.
  • the acquiring unit 100, the normalizing unit 101, the first clustering unit 102, the first screening unit 103, and the second clustering unit 104 in the device are also used in the fourth embodiment. in. details as follows:
  • the obtaining unit 100 is configured to obtain documents of all website sources.
  • documents are obtained from all websites by way of web crawling.
  • the normalization unit 101 is for standardizing the acquired documents.
  • the normalization is to normalize the attributes of the document, the attributes of the document including title, author, abstract, publication source, publication time, and the like.
  • the normalization unit 101 is used for normalization of the title, including segmentation of the title, unification of the full-width half-width, removal of the punctuation of the title, and the like.
  • the principle of the standardization unit 101 standardizing the author is to extract the full name of the first author of the document, divide the full name of the first author into a plurality of words, extract the initials of each word, and finally sort all the initials extracted into The author of the document.
  • first author's full name is divided into multiple words, when there are multiple uppercase letters abbreviated together, each uppercase letter is divided into one word.
  • the principle of standardization of the abstract by the normalization unit 101 is to extract the main part of the abstract, calculate the length of the sentence in the main part, find the sentence with the longest length, and calculate the signature of the abstract of the document. In other embodiments, sentences of other lengths may also be used.
  • the signature of the digest of the document can be calculated using the Message Digest Algorithm (MD5).
  • the standardization unit 101 standardizes the source of the publication mainly by the format of the unified publication source, including unified capitalization, deletion of symbols, and unification of the full-width half-width.
  • the normalization of the publication time by the normalization unit 101 includes extracting the year from the publication time.
  • the publication time of the document has various time formats, and the normalization unit 101 can extract the year from various different time formats. Of course, except for the way only the year is extracted In addition, it is also possible to adopt a method of unifying the same expression.
  • the first clustering unit 102 is configured to cluster the documents of similar titles according to the similarity of the titles of the standardized documents to obtain a plurality of first sets, and in parallel according to the first author of the standardized documents, publish the source and publish The similarity of the years, clustering similar documents to obtain multiple second sets.
  • the first screening unit 103 is configured to calculate the similarity of the documents in each first set, select a plurality of first sets that meet the conditions according to the similarity of the calculated documents, and calculate the documents in each second set. Similarity, a plurality of eligible second sets are screened according to the similarity of the calculated documents.
  • the first screening unit 103 is configured to: in each of the first set and the second set, calculate a similarity of each document in each of the first set and the second set according to a weight corresponding to each document set in advance And determining, in the first set or the second set, the similarity of each document is determined to be a first set or a second set that meets the condition.
  • the second clustering unit 104 is configured to cluster the plurality of selected first sets that are matched and the plurality of selected second sets that are selected, perform the same document, and perform the same document publishing source. Summary. Links to the publication sources of the same literature can be summarized.
  • the second clustering unit 104 is configured to: perform a key value pair forming process for each of the selected first set and the second set that are selected, respectively, in an eligible first set or second set
  • the key value pair formation process includes: respectively, each document is used as a key, and other documents are used as values corresponding to the key, thereby forming at least two key-value pairs; according to all the key-value pairs obtained, the key-value pairs having the same key Clustering to a set; respectively, for the obtained set, proceeding to execute the key-value pair forming process until a preset iteration is reached The number of times, the selected plurality of eligible first and second sets of the same documents are aggregated into one class.
  • the mapreduce model may be used to cluster the plurality of selected first sets and the same documents in the second set. Specifically, each of the selected first set and the second set are selected as input of a map stage, and a key-value pair corresponding to each of the first set and the second set that meets the condition is output in the map stage. All the selected key-value pairs corresponding to the first set and the second set are sorted, and all sorted key-value pairs are used as input data of the reduce stage, and the key is used in the reduce stage.
  • the same key-value pairs are clustered into a set, so that the reduce stage outputs multiple sets, and the documents in each set form a plurality of key-value pairs as input of the reduce stage, and are iterated multiple times by using the above method until reaching
  • the preset number of iterations aggregates the selected plurality of eligible first and second collections into a class, including all publication sources of the document.
  • the similarity of the titles of the documents is determined according to the similarity between the title signatures of the documents and the Hamming distance between the titles of the documents.
  • the author of the document, the source of publication and the similarity of the year of publication are determined by combining the author of the standardized document, the source of publication and the year of publication into a string, calculating the signature of the combined string, and then merging according to the literature.
  • the similarity between the signatures of the subsequent strings and the Hamming distance between the merged strings of the documents are determined.
  • the signature calculation unit 1020, the signature clustering unit 1021, the distance calculation unit 1022, and the second screening unit 1023 in the first clustering unit 102 are also used in the following embodiments. details as follows:
  • the signature calculation unit 1020 is configured to calculate a signature of the title of the document according to the title of the standardized document, and merge the first author, the publication source, and the publication year of the standardized document into A string that evaluates the signature of the merged string.
  • the signature clustering unit 1021 is configured to cluster two documents with similar titles according to the signature of the title of each document to obtain a plurality of first clusters, and according to the signature of the merged character string of each document, The merged strings are similarly clustered to obtain a plurality of second clusters.
  • the first cluster or the second cluster includes at least two documents.
  • the signature clustering unit 1021 is configured to: perform a key-value pair forming process on the signature of any one of the titles, first divide the signature of the title into T-numbers, and the T is a preset value, and each of the titles The block is used as the key, and the signature of the title is used as the value, so that the title corresponds to T key-value pairs.
  • each title will correspond to T key-value pairs.
  • the documents corresponding to the two titles are clustered into a first cluster output.
  • the above method is performed for the signature of the merged character string of each document.
  • the documents corresponding to the two merged strings are clustered into a second cluster output.
  • the documents with similar titles can be clustered by using the mapreduce model to obtain a plurality of first clusters, and the mapreduce model includes a map phase and a reduce phase.
  • the input data is processed by the map, and then subjected to reduce processing to finally obtain the output data.
  • the output of the map phase is in the form of a key-value pair.
  • the T blocks of each title are respectively input as the map stage, and the T key-value pairs corresponding to each title are output in the map stage.
  • the reduce stage clusters the documents corresponding to the two titles into a first cluster output.
  • the mapreduce model can be used to cluster two merged strings with similar documents to obtain multiple second clusters.
  • the distance calculation unit 1022 is configured to calculate a Hamming distance between each document in each first cluster according to a signature of a title of each document in each first cluster, and a combined character according to each document in each second cluster The signature of the string calculates the Hamming distance between the documents in each second cluster.
  • the Hamming distance is 1. If there are different values in the two positions, then the Hamming distance is 2, and so on. If the corresponding values at a certain position of the signature of the two merged strings are different, the Hamming distance is 1. If there are different values in the two positions, then the Hamming distance is 2, and so on.
  • the second screening unit 1023 is configured to filter out a first cluster whose Hamming distance is less than or equal to a preset threshold, and the plurality of selected first clusters that are selected are clusters obtained by clustering documents of similar titles. a set, and filtering out a second cluster whose Hamming distance is less than or equal to a preset threshold, and the selected second clusters that are selected are clustered by the documents corresponding to the similar merged character strings. Multiple second collections.
  • the similarity of the titles of the documents may be determined based on the similarity between the title signatures of the documents or the Hamming distance between the titles of the documents.
  • the author of the document, the publication source and the degree of similarity of the publication year may be determined based on the similarity between the signatures of the merged character strings of the document or the Hamming distance between the merged strings of the documents.
  • the extracting unit 10201, the determining unit 10202, and the calculating unit 10203 in the signature calculation unit 1020 are also used in the following embodiments. details as follows:
  • the extracting unit 10201 is configured to divide the title of the document into a plurality of subtitles, such as may be divided into uppercase letters. Calculate the length of each subtitle, and extract subtitles whose subtitles are longer than the preset length.
  • the extracting unit 10201 is further configured to divide the merged character string into a plurality of substrings, calculate a length of each substring, and extract a substring whose length of the substring is greater than a preset length.
  • the determining unit 10202 is configured to determine an n-gram feature of the extracted subtitle, the value of the n is from 1 to N, and the value of the N is set according to the length of the extracted subtitle.
  • the determining unit 10202 is further configured to determine an n-gram feature of the extracted substring.
  • the calculating unit 10203 is configured to calculate a signature of a title of the document according to the determined n-gram feature.
  • the signature of the title of the document can be calculated using the simhash algorithm, and the signature of the title of the calculated document is an n-bit signature consisting of 0 and 1.
  • the calculating unit 10203 is further configured to determine an n-gram feature of the extracted substring, and calculate a signature of the merged character string of the document.
  • the simhash algorithm can be used to calculate the signature of the merged string.
  • the calculated signature of the merged string is an n-bit signature consisting of 0 and 1.
  • the first set and the second set are only differences in expression for distinguishing between the sets of documents obtained in the two ways.
  • the apparatus for searching using the document normalization method in the first embodiment or the second embodiment, as shown in FIG. 11, includes: a receiving unit 200, a matching unit 201, and a presentation unit 202.
  • the receiving unit 200 is configured to receive a keyword input by a user.
  • the matching unit 201 is configured to match all the documents associated with the keyword according to the keyword.
  • the presentation unit 202 is configured to send the published source of all the associated documents and each associated document to the user.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
  • the above-described integrated unit implemented in the form of a software functional unit can be stored in a computer readable storage medium.
  • the above software functional unit is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform the methods of the various embodiments of the present invention. Part of the steps.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Artificial Intelligence (AREA)

Abstract

Disclosed are a document normalization method, a document searching method, corresponding apparatuses, a device, and a storage medium. The document normalization method comprises: acquiring documents from one or more website sources; standardizing the acquired documents; according to the degree of similarity among titles of the standardized documents, clustering documents with similar titles to obtain a plurality of document sets; calculating the degree of similarity among the documents in each document set, and according to the calculated degree of similarity among the documents, screening out a document set which meets a requirement; and clustering the same documents in the document set which is screened out and meets the requirement, and gathering publishing sources of the same documents together. The document searching method comprises: receiving a keyword inputted by a user; according to the keyword, searching out documents associated with the keyword; and in the search result, displaying the same documents by means of aggregation, and displaying the publishing source of each document. Compared with the prior art, normalization of the same documents is implemented in the present invention, and a basis for improving the efficiency of document searching is provided.

Description

文献归一方法、文献搜索方法及对应装置、设备和存储介质Document normalization method, literature search method and corresponding device, device and storage medium
本申请要求了申请日为2015年12月07日,申请号为201510888584.5发明名称为“文献归一方法、文献搜索方法及对应装置”的中国专利申请的优先权。The present application claims priority to Chinese Patent Application No. 201510888584.5, entitled "Document Normalization Method, Document Search Method, and Corresponding Device".
技术领域Technical field
本发明涉及计算机应用技术领域,尤其涉及一种文献归一方法、文献搜索方法及对应装置、设备和存储介质。The present invention relates to the field of computer application technologies, and in particular, to a document normalization method, a document search method, and corresponding devices, devices, and storage media.
背景技术Background technique
科研工作者在进行科学研究的时候,需要查找科研文献做调查。通常在查找科研文献时,需要精确查找到某篇具体的文章,并且尽可能多地找到该文章的电子来源渠道。但在实际检索的时候会遇到一些不便。When conducting scientific research, researchers need to find research literature for investigation. Usually when looking for scientific research literature, you need to find a specific article accurately, and find the electronic source channel of the article as much as possible. However, there are some inconveniences in the actual retrieval.
由于科研人员众多,发表的科研文献也非常多,存在一些作者相同、标题相同的文献,用户需要甄别哪些是同一篇文献,哪些不是,最后确定自己真正所需要的。这个过程比较繁琐,增加了用户的查找成本。Due to the large number of scientific researchers and the published research literature, there are some documents with the same authors and the same title. Users need to identify which ones are the same and which are not, and finally determine what they really need. This process is cumbersome and increases the cost of searching for users.
如图1所示,当用户搜索文献时,某篇文献可能会有多种电子来源渠道,并且每个电子来源渠道的数据质量不一,用户无法获取同一篇文献的所有电子来源,只能检索看到某条来源就查看某条来源,不利于筛选优质和有权限的资源,降低了用户体验。As shown in Figure 1, when a user searches for a document, a certain document may have multiple sources of electronic sources, and the data quality of each electronic source channel is different. The user cannot obtain all the electronic sources of the same document, and can only search. Viewing a source from a source is not conducive to filtering quality and licensed resources, reducing the user experience.
发明内容Summary of the invention
本发明提供了一种文献归一方法、文献搜索方法及对应装置,以便于实现相同文献的归一化,为提高文献搜索的效果提供基础。The invention provides a document normalization method, a literature search method and a corresponding device, so as to achieve the normalization of the same document, and provide a basis for improving the effect of the literature search.
具体技术方案如下:The specific technical solutions are as follows:
一种文献归一方法,包括: A document normalization method, including:
获取一个以上网站来源的文献;Obtain documents from more than one website source;
对所获取的文献进行标准化;Standardize the documents obtained;
根据标准化后的文献的标题的相似度,将相似标题的文献进行聚类得到多个文献集合;According to the similarity of the titles of the standardized documents, the documents of similar titles are clustered to obtain a plurality of document collections;
在每个文献集合中计算文献的相似度,根据所计算的文献的相似度筛选出符合条件的文献集合;Calculating the similarity of the documents in each document collection, and filtering out the qualified document collection according to the similarity of the calculated documents;
对筛选出的符合条件的文献集合,进行相同文献的聚类,并将相同的文献的发表来源进行汇总。For the selected set of qualified documents, cluster the same documents, and summarize the publication sources of the same documents.
根据本发明一优选实施例,所述文献的标题的相似度采用以下方式中的至少一种确定:According to a preferred embodiment of the invention, the similarity of the titles of the documents is determined in at least one of the following ways:
针对文献的标题计算签名,计算文献的标题签名之间的相似度;Calculating the signature for the title of the document and calculating the similarity between the title signatures of the document;
计算文献的标题之间的海明距离,依据海明距离确定文献标题之间的相似度。The Hamming distance between the titles of the documents is calculated, and the similarity between the titles of the documents is determined according to the Hamming distance.
根据本发明一优选实施例,在所述在每个文献集合中计算文献的相似度之前,该方法还包括:According to a preferred embodiment of the present invention, before the calculating the similarity of the document in each document collection, the method further comprises:
根据标准化后的文献的作者,发表来源和发表年份中至少一种属性的相似度,将相似的文献进行聚类得到多个文献集合。According to the author of the standardized literature, the similarity of at least one attribute in the source and the publication year is published, and similar documents are clustered to obtain a plurality of document collections.
根据本发明一优选实施例,所述根据标准化后的文献的作者,发表来源和发表年份中至少一种属性的相似度采用以下方式中的至少一种确定:According to a preferred embodiment of the present invention, according to the author of the standardized document, the similarity of at least one of the publication source and the publication year is determined by at least one of the following methods:
将标准化后的文献的作者,发表来源及发表年份合并为字符串,计算合并后的字符串的签名,计算文献的合并后的字符串的签名之间的相似度; Combining the author of the standardized document, the source of publication, and the year of publication into a string, calculating the signature of the merged string, and calculating the similarity between the signatures of the merged strings of the document;
将标准化后的文献的作者,发表来源及发表年份合并为字符串,计算合并后的字符串之间的海明距离,依据海明距离确定文献的作者,发表来源及发表年份的相似度。The authors of the standardized literature, the source of publication and the year of publication are combined into a string, the Hamming distance between the merged strings is calculated, and the author of the document, the source of the publication, and the similarity of the publication year are determined according to the Hamming distance.
根据本发明一优选实施例,在得到多个文献集合之后,且在每个文献集合中计算文献的相似度之前,该方法还包括:According to a preferred embodiment of the present invention, after obtaining a plurality of document collections and calculating the similarity of the documents in each document collection, the method further comprises:
基于文献集合中文献间的海明距离,筛选出海明距离小于或等于预设阈值的文献集合。Based on the Hamming distance between the documents in the collection of documents, a collection of documents whose Hamming distance is less than or equal to a preset threshold is screened.
根据本发明一优选实施例,所述根据所计算的文献的相似度筛选出符合条件的文献集合,包括:According to a preferred embodiment of the present invention, the screening of the qualified document collection according to the similarity of the calculated documents comprises:
在每一个文献集合中,根据预先设置的各文献属性所对应的权重,计算每个文献集合中各文献间的相似度,将各文献间的相似度大于预设总分的文献集合确定为符合条件的文献集合。In each document collection, the similarity between each document in each document collection is calculated according to the weight corresponding to each document attribute set in advance, and the document collection with the similarity greater than the preset total score among the documents is determined to be consistent. A collection of documents for conditions.
根据本发明一优选实施例,所述对筛选出的符合条件的文献集合,进行相同文献的聚类,包括:According to a preferred embodiment of the present invention, the clustering of the selected documents is performed on the selected set of qualified documents, including:
分别针对筛选出的每个符合条件的文献集合执行键值对形成过程,所述键值对形成过程包括:分别将各文献作为key,其他文献作为该key对应的value,从而形成至少两个key-value对;Performing a key-value pair forming process for each of the selected document sets that are selected, the key-value pair forming process includes: respectively, each document is used as a key, and other documents are used as values corresponding to the key, thereby forming at least two keys. -value pair;
依据得到的所有key-value对,将key相同的key-value对聚类到一个集合;Clustering the same key-value pairs of keys to a set based on all the key-value pairs obtained;
分别针对得到的集合转至执行所述键值对形成过程,直至达到预设的迭代次数。The key set formation process is performed separately for the obtained set until the preset number of iterations is reached.
根据本发明一优选实施例,所述标准化包括:According to a preferred embodiment of the invention, the standardization comprises:
对文献的第一作者的全名进行分词处理,提取每个单词的首字母, 将提取的首字母组合作为标准化后的文献作者;或者,Sub-word processing of the full name of the first author of the document, extracting the first letter of each word, The extracted initials are used as the authors of the standardized literature; or,
提取文献摘要的主体部分中最长的句子,计算该最长句子的签名;或者,Extract the longest sentence in the body part of the document abstract and calculate the signature of the longest sentence; or,
统一文献来源的格式;或者,Uniform literature source format; or,
统一文献发表时间的格式,或者仅提取文献发表时间的年份。The format of the publication time of the document is unified, or only the year in which the publication of the document is published.
根据本发明一优选实施例,所述针对文献的标题计算签名,包括:According to a preferred embodiment of the invention, the calculating a signature for the title of the document comprises:
将文献的标题切分成多个子标题,计算每个子标题的长度,提取子标题的长度大于预设长度的子标题;Dividing the title of the document into a plurality of subtitles, calculating a length of each subtitle, and extracting subtitles whose subtitles are longer than a preset length;
确定所提取的子标题的n-gram特征,所述n的取值为从1到N的正整数,所述N为预设的正整数;Determining an n-gram feature of the extracted subtitle, wherein the value of n is a positive integer from 1 to N, and the N is a preset positive integer;
依据所确定的n-gram特征,计算文献的标题的签名。The signature of the title of the document is calculated based on the determined n-gram characteristics.
一种文献搜索方法,该方法包括:A document search method, the method comprising:
接收用户输入的关键词;Receiving keywords input by the user;
根据所述关键词,搜索与所述关键词相关联的文献;Searching for documents associated with the keyword based on the keyword;
在搜索结果中,将相同文献进行聚合展现,并展现各文献的发表来源;In the search results, the same documents are aggregated and displayed, and the publication sources of each document are displayed;
其中相同文献采用所述文献归一的方法进行归一化。The same literature is normalized by the method of normalization of the literature.
一种文献归一装置,包括:A document normalization device comprising:
获取单元,用于获取一个以上网站来源的文献;An acquisition unit for obtaining documents from more than one website source;
标准化单元,用于对所获取的文献进行标准化;Standardization unit for standardizing the acquired documents;
第一聚类单元,用于根据标准化后的文献的标题的相似度,将相似标题的文献进行聚类得到多个文献集合;a first clustering unit, configured to cluster the documents of similar titles according to the similarity of the titles of the standardized documents to obtain a plurality of document collections;
第一筛选单元,用于在每个文献集合中计算文献的相似度,根据所 计算的文献的相似度筛选出符合条件的文献集合;a first screening unit for calculating the similarity of the document in each document collection, according to the Calculating the similarity of the documents to screen out a set of qualified documents;
第二聚类单元,用于对筛选出的符合条件的文献集合,进行相同文献的聚类,并将相同的文献的发表来源进行汇总。The second clustering unit is configured to perform clustering of the same documents on the selected qualified document collections, and summarize the publication sources of the same documents.
根据本发明一优选实施例,所述第一聚类单元采用以下方式中的至少一种确定文献的标题的相似度:According to a preferred embodiment of the present invention, the first clustering unit determines the similarity of the titles of the documents in at least one of the following manners:
针对文献的标题计算签名,计算文献的标题签名之间的相似度;Calculating the signature for the title of the document and calculating the similarity between the title signatures of the document;
计算文献的标题之间的海明距离,依据海明距离确定文献标题之间的相似度。The Hamming distance between the titles of the documents is calculated, and the similarity between the titles of the documents is determined according to the Hamming distance.
根据本发明一优选实施例,所述第一聚类单元,还用于在所述在每个文献集合中计算文献的相似度之前,根据标准化后的文献的作者,发表来源和发表年份中至少一种属性的相似度,将相似的文献进行聚类得到多个文献集合。According to a preferred embodiment of the present invention, the first clustering unit is further configured to: before the calculating the similarity of the document in each document collection, according to the author of the standardized document, at least the publication source and the publication year A similarity of attributes, clustering similar documents to obtain multiple sets of documents.
根据本发明一优选实施例,所述第一聚类单元采用以下方式中的至少一种确定所述至少一种属性的相似度:According to a preferred embodiment of the present invention, the first clustering unit determines the similarity of the at least one attribute in at least one of the following manners:
将标准化后的文献的作者,发表来源及发表年份合并为字符串,计算合并后的字符串的签名,计算文献的合并后的字符串的签名之间的相似度;Combining the author of the standardized document, the source of publication, and the year of publication into a string, calculating the signature of the merged string, and calculating the similarity between the signatures of the merged strings of the document;
将标准化后的文献的作者,发表来源及发表年份合并为字符串,计算合并后的字符串之间的海明距离,依据海明距离确定文献的作者,发表来源及发表年份的相似度。The authors of the standardized literature, the source of publication and the year of publication are combined into a string, the Hamming distance between the merged strings is calculated, and the author of the document, the source of the publication, and the similarity of the publication year are determined according to the Hamming distance.
根据本发明一优选实施例,还包括:According to a preferred embodiment of the present invention, the method further includes:
第二筛选单元,用于在得到多个文献集合之后,且在每个文献集合中计算文献的相似度之前,基于文献集合中文献间的海明距离,筛选出 海明距离小于或等于预设阈值的文献集合。a second screening unit, configured to filter out the Hamming distance between the documents in the document collection after obtaining the plurality of document collections and calculating the similarity of the documents in each document collection A collection of documents whose Hamming distance is less than or equal to a preset threshold.
根据本发明一优选实施例,所述第一筛选单元具体用于,在每一个文献集合中,根据预先设置的各文献属性所对应的权重,计算每个文献集合中各文献间的相似度,将各文献间的相似度大于预设总分的文献集合确定为符合条件的文献集合。According to a preferred embodiment of the present invention, the first screening unit is specifically configured to: in each document set, calculate a similarity between each document in each document collection according to a weight corresponding to each document attribute set in advance, A collection of documents in which the similarity between the documents is greater than the preset total score is determined as a set of qualified documents.
根据本发明一优选实施例,所述第二聚类单元在对筛选出的符合条件的文献集合,进行相同文献的聚类时,具体执行:According to a preferred embodiment of the present invention, when the second clustering unit performs clustering of the same document on the selected set of qualified documents, the second clustering unit performs:
分别针对筛选出的每个符合条件的文献集合执行键值对形成过程,所述键值对形成过程包括:分别将各文献作为key,其他文献作为该key对应的value,从而形成至少两个key-value对;Performing a key-value pair forming process for each of the selected document sets that are selected, the key-value pair forming process includes: respectively, each document is used as a key, and other documents are used as values corresponding to the key, thereby forming at least two keys. -value pair;
依据得到的所有key-value对,将key相同的key-value对聚类到一个集合;Clustering the same key-value pairs of keys to a set based on all the key-value pairs obtained;
分别针对得到的集合转至执行所述键值对形成过程,直至达到预设的迭代次数。The key set formation process is performed separately for the obtained set until the preset number of iterations is reached.
根据本发明一优选实施例,所述标准化单元,具体用于:According to a preferred embodiment of the present invention, the standardization unit is specifically configured to:
对文献的第一作者的全名进行分词处理,提取每个单词的首字母,将提取的首字母组合作为标准化后的文献作者;或者,Performing word segmentation on the full name of the first author of the document, extracting the first letter of each word, and using the extracted initial combination as the author of the standardized document; or
提取文献摘要的主体部分中最长的句子,计算该最长句子的签名;或者,Extract the longest sentence in the body part of the document abstract and calculate the signature of the longest sentence; or,
统一文献来源的格式;或者,Uniform literature source format; or,
统一文献发表时间的格式,或者仅提取文献发表时间的年份。The format of the publication time of the document is unified, or only the year in which the publication of the document is published.
根据本发明一优选实施例,所述第一聚类单元在针对文献的标题计算签名时,具体执行: According to a preferred embodiment of the present invention, when the first clustering unit calculates a signature for the title of the document, the specific execution is:
将文献的标题切分成多个子标题,计算每个子标题的长度,提取子标题的长度大于预设长度的子标题;Dividing the title of the document into a plurality of subtitles, calculating a length of each subtitle, and extracting subtitles whose subtitles are longer than a preset length;
确定所提取子标题的n-gram特征,所述n的取值为从1到N的正整数,所述N为预设的正整数Determining an n-gram feature of the extracted subtitle, the value of n being a positive integer from 1 to N, the N being a preset positive integer
依据所确定的n-gram特征,计算文献的标题的签名。The signature of the title of the document is calculated based on the determined n-gram characteristics.
一种文献搜索装置,该装置包括:A document search device, the device comprising:
接收单元,用于接收用户输入的关键词;a receiving unit, configured to receive a keyword input by a user;
匹配单元,用于根据所述关键词,搜索出与所述关键词相关联的文献;a matching unit, configured to search for a document associated with the keyword according to the keyword;
展现单元,用于在搜索结果中,将相同文献进行聚合展现,并展现各文献的发表来源,其中相同文献采用所述文献归一的装置进行归一化。a presentation unit for synthesizing the same documents in the search results and presenting the publication sources of the respective documents, wherein the same documents are normalized using the device of the document normalization.
由以上技术方案可以看出,本发明能精确地将相同的文献聚合在一起,并清晰地提供文献来源,当用户搜索文献时,能够将同一篇文献的不同发表来源汇聚到一起呈现给用户,提升了用户体验。It can be seen from the above technical solutions that the present invention can accurately aggregate the same documents together and clearly provide the source of the literature. When the user searches for the documents, the different publication sources of the same document can be brought together and presented to the user. Improved user experience.
附图说明DRAWINGS
图1是现有技术中搜索文献的示意图。Figure 1 is a schematic diagram of a search document in the prior art.
图2是本发明实施例提供的文献归一方法的流程图。2 is a flow chart of a method for normalizing documents according to an embodiment of the present invention.
图3是本发明实施例中对作者进行标准化的示意图。3 is a schematic diagram of standardization of an author in an embodiment of the present invention.
图4是本发明实施例提供的对相同的文献进行聚类的示意图。FIG. 4 is a schematic diagram of clustering the same document according to an embodiment of the present invention.
图5是本发明实施例提供的一个搜索结果展现的示意图。FIG. 5 is a schematic diagram of a search result presentation provided by an embodiment of the present invention.
图6是本发明实施例中在reduce阶段对两个标题的签名处理的示意图。FIG. 6 is a schematic diagram of signature processing of two titles in the reduce phase in the embodiment of the present invention.
图7是本发明实施例提供的另一个文献归一的方法流程图。 FIG. 7 is a flow chart of another method for normalizing documents according to an embodiment of the present invention.
图8是本发明实施例提供的装置结构示意图。FIG. 8 is a schematic structural diagram of an apparatus according to an embodiment of the present invention.
图9是图8中第一聚类单元的一个实施例的结构示意图。FIG. 9 is a schematic structural diagram of an embodiment of the first clustering unit of FIG. 8. FIG.
图10是图8中签名计算单元的一个实施例的结构示意图。FIG. 10 is a schematic structural diagram of an embodiment of the signature calculation unit of FIG. 8.
图11是利用文献归一方法进行搜索的装置的结构示意图。Figure 11 is a block diagram showing the structure of an apparatus for searching using the document normalization method.
具体实施方式detailed description
为了使本发明的目的、技术方案和优点更加清楚,下面结合附图和具体实施例对本发明进行详细描述。The present invention will be described in detail below with reference to the drawings and specific embodiments.
图2是本发明文献归一方法的实施例一的流程图。如图2所示,该文献归一方法包括:Figure 2 is a flow chart showing the first embodiment of the document normalization method of the present invention. As shown in Figure 2, the document normalization method includes:
S10,获取所有网站来源的文献。S10, obtain documents from all website sources.
具体地,通过网络爬虫的方式从所有网站获取文献。Specifically, documents are obtained from all websites by way of web crawling.
S11,对所获取的文献进行标准化。S11, standardizing the acquired documents.
在本发明的实施例中,所述标准化是对文件的属性进行标准化,所述文献的属性包括,标题、作者、摘要、发表来源、发表时间等。In an embodiment of the invention, the normalization is to normalize the attributes of the document, the attributes of the document including title, author, abstract, publication source, publication time, and the like.
具体地,对标题的标准化包括,对标题的切分、半角全角的统一化、去掉标题的标点等。例如,某篇文献的标题为re:Coagulation and——Flocculation,经过标题的标准化后为re Coagulation and--Flocculation。Specifically, the standardization of the title includes the segmentation of the title, the unification of the full-width half-width, the removal of the punctuation of the title, and the like. For example, the title of a document is re:Coagulation and -Flocculation, which is re Coagulation and--Flocculation after standardization of the title.
由于站点的作者可能缩写是不同的,需要对文献的作者进行标准化。对作者的标准化的原理是提取文献的第一作者的全名,将第一作者的全名切分成多个单词,提取每个单词的首字母,最后将提取的所有首字母排序进行排序作为文献所对应的作者。在将第一作者的全名切分成多个单词时,当有多个大写字母缩写在一起时,将每个大写字母切分成一个单词。 Since the authors of the site may have different abbreviations, the authors of the literature need to be standardized. The principle of standardization for the author is to extract the full name of the first author of the document, divide the full name of the first author into multiple words, extract the first letter of each word, and finally sort all the initials extracted as a document. Corresponding author. When the first author's full name is divided into multiple words, when there are multiple uppercase letters abbreviated together, each uppercase letter is divided into one word.
如图3所示,是本发明中对作者进行标准化的示意图。在这个例子中,从网络上获取的一个作者的名字为:Carlos N.Slia,Carlos Nascimento.Slia和SN Carlos。Carlos N.Slia被拆分为Carlos、N及Slia三个单词,取这三个单词的首字母为C、N、S。Carlos Nascimento.Slia被拆分为Carlos、Nascimento及Slia,取这三个单词的首字母为C、N、S。SN Carlos被拆分为S、N及Carlos,取这三个单词的首字母为S、N、C。最后按照字母表的顺序排序成CNS。As shown in Fig. 3, it is a schematic diagram of standardization of the author in the present invention. In this example, the names of an author obtained from the network are: Carlos N.Slia, Carlos Nascimento.Slia and SN Carlos. Carlos N.Slia is divided into three words: Carlos, N and Slia. The first letters of these three words are C, N, and S. Carlos Nascimento.Slia was split into Carlos, Nascimento and Slia, taking the first letters of the three words C, N, S. SN Carlos is split into S, N, and Carlos. The first letters of these three words are S, N, and C. Finally sorted into CNS in alphabetical order.
对摘要的标准化的原理为提取摘要的主体部分,计算主体部分中句子的长度,找出长度最长的句子,计算文献的摘要的签名。在其他实施方式中,也可以是其他长度的句子。可以利用消息摘要算法第五版(Message Digest Algorithm,MD5)计算文献的摘要的签名。The principle of standardization of the abstract is to extract the main part of the abstract, calculate the length of the sentence in the main part, find the sentence with the longest length, and calculate the signature of the abstract of the document. In other embodiments, sentences of other lengths may also be used. The signature of the digest of the document can be calculated using the Message Digest Algorithm (MD5).
发表来源包括期刊、会议、文献集等。对发表来源的标准化主要是统一发表来源的格式,包括统一大小写、删除符号、半角全角的统一化等。Publication sources include journals, conferences, and collections. The standardization of the source of publication is mainly to unify the format of the source of publication, including uniformization of uppercase and lowercase, deletion of symbols, and unification of full-width corners.
对发表时间的标准化包括,从发表时间中提取年份数据。在网络上,文献的发表时间会有各种不同的时间格式,对发表时间的标准化包括从各种不同的时间格式中提取年份。例如,发表时间分别为:1990,1990-11-11,1990/11/11,对发表时间的标准化后得到1999。当然,除了仅提取年份的方式之外,也可以采用统一成相同表述的方式,例如将1990-11-11、1990/11/11、1990年11月11日、1990.11.11等表述统一成1990-11-11。Standardization of publication time includes extracting year data from publication time. On the web, there are various time formats for publication of documents, and standardization of publication time involves extracting years from a variety of different time formats. For example, the publication time was: 1990, 1990-11-11, 1990/11/11, and the standardization of the publication time was obtained in 1999. Of course, in addition to the method of extracting only the year, the same expression can be used, for example, the expressions 1990-11-11, 1990/11/11, November 11, 1990, 1990.11.11 are unified into 1990. -11-11.
S12,根据标准化后的文献的标题的相似度,将相似标题的文献进行聚类得到多个第一集合。所述第一集合包括至少两篇文献。 S12. Clustering the documents of similar titles according to the similarity of the titles of the standardized documents to obtain a plurality of first sets. The first set includes at least two documents.
S13,在每个第一集合中计算文献的相似度,根据所计算的文献的相似度筛选出多个符合条件的第一集合。S13: Calculate the similarity of the document in each first set, and select a plurality of first sets that meet the condition according to the similarity of the calculated documents.
具体地,预先设置文献属性所对应的权重,所述文献属性可以是作者、摘要、发表来源、发表时间等特征。在每一个第一集合中,根据预先设置的文献属性所对应的权重,计算每个第一集合中各文献的相似度,将各文献的相似度大于预设总分的第一集合确定为符合条件的第一集合。Specifically, the weight corresponding to the document attribute is set in advance, and the document attribute may be a feature such as an author, a summary, a publication source, and a publication time. In each of the first sets, the similarity of each document in each first set is calculated according to the weight corresponding to the preset document attribute, and the first set whose similarity of each document is greater than the preset total score is determined to be in accordance with The first set of conditions.
举例而言,一个第一集合中有两篇文献,假设作者权重为4,摘要权重为2,期刊权重为2,发表时间权重为2,预设总分为5,文献a的特征如下,标题:A General Stability Result for Viscoelastic Equations with Singular Kernels,作者:MM Cavalcanti,期刊:缺失,发表时间:1999-02-11,摘要签名:b47b61cad59b93c5ad99e8820b71f4db;文献b的特征如下,标题:A General Stabilities Result for Viscoelastic Equations with Singular Kernels,作者MC Murphy,期刊:Journal of Applied & Computational Mathematics,发表时间:1999,摘要签名:b47b61cad59b93c5ad99e8820b71f4db;文献a与文献b的作者相同,作者这个特征所对应的值为1*4,同理,文献a与文献b的发表来源不同,发表来源这个特征所对应的值为0*2,因此,计算出的两篇文献的相似度为:1*4+0*2+1*2+1*2=8>5,所以认为文献a与文献b是相同的。若文献b与文献c也是相同,则文献a、文献b、文献c相同。这样就可以将相同的文献聚类在一起。For example, there are two documents in a first set, assuming that the author has a weight of 4, the abstract weight is 2, the journal weight is 2, the publication time weight is 2, and the default total score is 5. The characteristics of document a are as follows. : a General Stability Result for Viscoelastic Equations with Singular Kernels, author: MM Cavalcanti, Journal: missing, published: 1999-02-11, summary signature: b47b61cad59b93c5ad99e8820b71f4db; b literature features the following title: a General Stabilities Result for Viscoelastic Equations With Singular Kernels, author MC Murphy, Journal: Journal of Applied & Computational Mathematics, published: 1999, abstract signature: b47b61cad59b93c5ad99e8820b71f4db; document a is the same as the author of document b, the author corresponds to the value of 1 * 4, the same reason The publication a is different from the publication source of document b. The value corresponding to the publication source is 0*2. Therefore, the similarity of the two documents calculated is: 1*4+0*2+1*2+1 *2=8>5, so the document a is considered to be the same as the document b. If the document b is the same as the document c, the document a, the document b, and the document c are the same. This will cluster the same documents together.
S14,将筛选出的多个符合条件的第一集合中相同的文献进行聚类,并将相同的文献的发表来源进行汇总。可以将相同的文献的发表来源的链接进行汇总。 S14: Clustering the same documents in the plurality of eligible first sets that are selected, and summarizing the publication sources of the same documents. Links to the publication sources of the same literature can be summarized.
具体地,分别针对筛选出的每个符合条件的第一集合执行键值对形成过程,一个符合条件的第一集合中所述键值对形成过程包括:分别将各文献作为key,其他文献作为该key对应的value,从而形成至少两个key-value对;依据得到的所有key-value对,将key相同的key-value对聚类到一个集;分别针对得到的集合转至执行所述键值对形成过程,直至达到预设的迭代次数,所述预设的迭代次数为经验值。Specifically, the key value pair forming process is performed separately for each of the selected first sets that are selected, and the key value pair forming process in the first set that meets the condition includes: respectively, each document is used as a key, and other documents are used as The key corresponds to the value, thereby forming at least two key-value pairs; according to all the key-value pairs obtained, the key-value pairs with the same key are clustered into one set; respectively, the obtained set is transferred to execute the key The value pair formation process until a preset number of iterations is reached, the preset number of iterations being an empirical value.
可以利用mapreduce模型将筛选出的多个符合条件的第一集合中相同的文献进行聚类。具体地,将筛选出的每个符合条件的第一集合作为map阶段的输入,在map阶段输出每个符合条件的第一集合所对应的key-value对。将所有筛选出的多个符合条件的第一集合所对应的key-value对中key进行排序,将排序后的所有key-value对作为reduce阶段的输入数据,在reduce阶段将key相同的key-value对聚类到一个集合中,这样reduce阶段会输出多个集合,每个集合中的文献再组成多个key-value对作为reduce阶段的输入,利用上述方法多次迭代直至直至达到预设的迭代次数,将筛选出的多个符合条件的第一集合中相同的文献聚合到一个类中,该类中包括该篇文献的所有发表来源。The mapreduce model may be used to cluster the same documents in the plurality of eligible first sets that are selected. Specifically, each of the selected first sets that are filtered out is used as an input of a map stage, and a key-value pair corresponding to each of the eligible first sets is output in the map stage. All the key-value pairs corresponding to the first selected first set are sorted, and all the key-value pairs after sorting are used as the input data of the reduce stage, and the keys with the same key are in the reduce stage. The value pairs are clustered into a set, so the reduce stage outputs multiple sets, and the documents in each set form multiple key-value pairs as the input of the reduce stage. The above method is used to iterate multiple times until the preset is reached. The number of iterations, the same documents in the selected plurality of eligible first sets are aggregated into one class, and all the publication sources of the document are included in the class.
例如,如图4所示,若所筛选出的每个符合条件的第一集合中包括两篇文献,所筛选出的多个符合条件的第一集合分别为(a,b),(b,c),(d,f),所筛选出的多个符合条件的第一集合在map阶段输出的key-value对为a-b,b-a,b-c,c-b,d-f,f-d。将所有筛选出的多个符合条件的第一集合所对应的key-value对中key排序后为:a-b,b-a,b-c,c-b,d-f,f-d,在reduce阶段输出[a b],[a b c],[c b],[d f],[f d],按照上述方法,再将[a b]中两两文献组成多个key-value对,即为a-b,b-a;同理将[a b c]中 两两文献组成多个key-value对,[c b]中两两文献组成多个key-value对,[d f]中两两文献组成多个key-value对,[f d]中两两文献组成多个key-value对作为map阶段的输入,如此多次迭代能得到(a,b,c)为一类,(d,f)为一类。For example, as shown in FIG. 4, if each of the eligible first sets that are selected includes two documents, the plurality of selected first sets that are selected are (a, b), (b, c), (d, f), the key-value pairs output by the plurality of eligible first sets that are filtered in the map stage are ab, ba, bc, cb, df, fd. Sorting the keys of the key-value pairs corresponding to the first selected multiple eligible sets: ab, ba, bc, cb, df, fd, output [a b], [a b] in the reduce stage c], [c b], [d f], [f d], according to the above method, then combine the two or two documents in [a b] into multiple key-value pairs, namely ab, ba; the same reason will be [ a b c] Two or two documents constitute multiple key-value pairs. In [c b], two or two documents constitute multiple key-value pairs, and [d f] two or two documents constitute multiple key-value pairs, [f d] two pairs The literature consists of multiple key-value pairs as input to the map stage. So many iterations can get (a, b, c) as a class and (d, f) as a class.
进一步,在其他实施方式中,一种利用本实施例中的文献归一方法的搜索方法中包括接收用户输入的关键词;根据所述关键词,匹配出与所述关键词所有相关联的文献;将所有相关联的文献及每篇相关联的文献汇总后的发表来源发送给用户。具体地,将每篇相关联的文献汇总后的发表来源的链接显示给用户。这样用户将同一篇文章的不同发表来源链接汇聚到一起,提升了用户体验。如图5所示,是本发明将同一文献的发表来源汇聚在一起的示意图,与图1所示的内容相比,图5中将与“simulation study on angle measurement accuracy of star sensor”相同的文献聚合在一起,并将该文献的所有来源的链接呈现给用户,如图5中方框所框的地方,该相同文献的来源包括:ReserchGate、SPIE、reviews.spiedigita,即将这些来源的相同文献进行了聚合展现,并示出各来源,便于用户选择。Further, in other embodiments, a search method using the document normalization method in the embodiment includes receiving a keyword input by a user; and matching, according to the keyword, all the documents associated with the keyword Send the published source of all associated documents and each associated document to the user. Specifically, a link to the publication source of each of the associated documents is displayed to the user. In this way, the user brings together the different publishing source links of the same article, which improves the user experience. As shown in FIG. 5, the present invention is a schematic diagram in which the publication sources of the same document are gathered together. Compared with the content shown in FIG. 1, the same document as "simulation study on angle measurement accuracy of star sensor" in FIG. Aggregate together and present links to all sources of the document to the user, as framed by the box in Figure 5, the sources of the same literature include: ReserchGate, SPIE, reviews.spiedigita, the same documents from these sources were The aggregation is presented and the sources are shown for user selection.
优选地,作为S12的一种实施方式,所述文献的标题的相似度根据文献的标题签名之间的相似度及文献的标题之间的海明距离来确定,则S12中包括:Preferably, as an implementation of S12, the similarity of the titles of the documents is determined according to the similarity between the title signatures of the documents and the Hamming distance between the titles of the documents, and S12 includes:
S120,根据标准化后的文献的标题计算文献的标题的签名。S120. Calculate a signature of the title of the document according to the title of the standardized document.
S121,根据每篇文献的标题的签名,将标题相似的文献进行聚类,得到多个第一簇。所述第一簇至少包括两篇文献。S121, according to the signature of the title of each document, the documents with similar titles are clustered to obtain a plurality of first clusters. The first cluster includes at least two documents.
具体地,对任意一个标题的签名执行键值对形成过程,先将该标题 的签名切分成T份数,所述T为预设值,将该标题的每一个分块作为key,该标题的签名作为value,这样该标题会对应T个key-value对。按照上述方法,每个标题会对应的T个key-value对。当有两个标题各自对应的T个key-value对中,至少有一个key相同时,将这两个标题所对应的文献聚类成一个第一簇输出。Specifically, the key value pair forming process is performed on the signature of any one of the titles, and the title is first The signature is divided into T parts, the T is a preset value, each block of the title is used as a key, and the signature of the title is used as a value, so that the title corresponds to T key-value pairs. According to the above method, each title will correspond to T key-value pairs. When there are at least one key in the T key-value pairs corresponding to the two titles, the documents corresponding to the two titles are clustered into a first cluster output.
可以利用mapreduce模型将标题相似的文献进行聚类,得到多个第一簇,所述mapreduce模型包括map阶段和reduce阶段。输入数据经过map处理,再经过reduce处理,最终得到输出数据。map阶段的输出是key-value对的形式。将每个标题的T个分块分别作为map阶段的输入,在则map阶段输出每个标题所对应的T个key-value对。在reduce阶段,当两个标题各自对应的T个key-value对中,至少有一个key相同时,reduce阶段将这两个标题所对应的文献聚类成一个第一簇输出。The documents with similar titles can be clustered by using the mapreduce model to obtain a plurality of first clusters, and the mapreduce model includes a map phase and a reduce phase. The input data is processed by the map, and then subjected to reduce processing to finally obtain the output data. The output of the map phase is in the form of a key-value pair. The T blocks of each title are respectively input as the map stage, and the T key-value pairs corresponding to each title are output in the map stage. In the reduce phase, when at least one key is the same among the T key-value pairs corresponding to the two titles, the reduce stage clusters the documents corresponding to the two titles into a first cluster output.
例如,如图6所示,文献a的标题的签名为111111000100100,分成四份为1111,1110,0010,0100,文献b的标题的签名为1101111000000000分成四份为1101,1110,0000,0000,从图6中可以看出,文献a的标题的签名的第二个分块与文献b的标题的签名的第二个分块相同。即将文献a与文献b聚类到一个第一簇中。For example, as shown in FIG. 6, the signature of the title of the document a is 111111000100100, divided into four parts of 1111, 1110, 0010, 0100, and the signature of the title of the document b is 1101111000000000 divided into four parts of 1101, 1110,000, 0000, from As can be seen in Figure 6, the second block of the signature of the title of document a is identical to the second block of the signature of the title of document b. The document a and the document b are clustered into a first cluster.
S122,根据每个第一簇中文献的标题的签名,计算每个第一簇中文献间的海明距离。S122. Calculate a Hamming distance between documents in each first cluster according to the signature of the title of the document in each first cluster.
如果两个标题签名的某一位置上对应的值不同,那么海明距离是1。若有两个位置上对应的值不同,那么海明距离是2,依次类推。举例而言,文献a的标题的签名为111111000100100,文献b的标题的签名为1101111000000000,文献a与文献b的第3位数,第11位数,第14位 数不同,则文献a与文献b的海明距离为3。If the corresponding values at a certain position of the two title signatures are different, the Hamming distance is 1. If there are different values in the two positions, then the Hamming distance is 2, and so on. For example, the signature of the title of document a is 111111000100100, the signature of the title of document b is 1101111000000000, the third digit of document a and document b, the eleventh digit, the 14th digit The number is different, and the Hamming distance between the document a and the document b is 3.
S123,筛选出海明距离小于或等于预设阈值的第一簇,所筛选出的多个符合条件的第一簇即为将相似标题的文献进行聚类所得到的多个第一集合。S123: Filter out a first cluster whose Hamming distance is less than or equal to a preset threshold, and select a plurality of eligible first clusters to be a plurality of first sets obtained by clustering documents of similar titles.
在其他实施方式中,所述文献的标题的相似度可以根据文献的标题签名之间的相似度或者文献的标题之间的海明距离来确定。In other embodiments, the similarity of the titles of the documents may be determined based on the similarity between the title signatures of the documents or the Hamming distance between the titles of the documents.
优选地,作为S120的一种实施方式,在S120中还可包括:Preferably, as an implementation manner of S120, in S120, the method further includes:
(1)将文献的标题切分成多个子标题,如可以按照大写字母切分。计算每个子标题的长度,提取子标题的长度大于预设长度的子标题。(1) Divide the title of the document into multiple subtitles, such as by dividing the uppercase letters. Calculate the length of each subtitle, and extract subtitles whose subtitles are longer than the preset length.
举例而言,预设长度为10个字符,标准化后的标题为R Genre classification via an lz78-based string kernel,然后切分成R及Genre classification via an lz78-based string kernel。R为1个字符,其长度小于10个字符,因此,R被排除。For example, the preset length is 10 characters, and the standardized title is R Genre classification via an lz78-based string kernel, and then divided into R and Genre classification via an lz78-based string kernel. R is 1 character and its length is less than 10 characters, so R is excluded.
(2)确定所提取的子标题的n-gram特征,所述n的取值从1到N,所述N的取值根据所提取的子标题的长度设定。(2) determining an n-gram feature of the extracted subtitle, the value of n being from 1 to N, and the value of N is set according to the length of the extracted subtitle.
举例而言,标题为“A B C”,若m取3,则该文献的标题的特征为[A,B,C,AB,BC,ABC]。For example, the title is "A B C", and if m is 3, the title of the document is characterized by [A, B, C, AB, BC, ABC].
(3)依据所确定的n-gram特征,计算文献的标题的签名。可以利用simhash算法计算文献的标题的签名,计算出来文献的标题的签名是由0及1组成的n位签名。例如为64位签名,16位签名等。(3) Calculate the signature of the title of the document based on the determined n-gram characteristics. The signature of the title of the document can be calculated using the simhash algorithm, and the signature of the title of the document is calculated to be an n-bit signature consisting of 0 and 1. For example, a 64-bit signature, a 16-bit signature, and the like.
在上述文献归一方法的实施例一中仍有不足,由于网站采集,编辑,编码的原因,同一篇文献的各个发表来源的标题可能差距比较大,为了克服上述问题,在用文献的标题进行聚类的同时,并行地利用文献的第 一作者,发表来源及发表时间进行聚类,弥补只用标题聚类的不足。In the first embodiment of the above-mentioned document normalization method, there are still deficiencies. Due to the reasons for website collection, editing, and coding, the titles of the respective publication sources of the same document may have a large gap. In order to overcome the above problems, the title of the document is used. Simultaneous use of the literature in parallel One author, the source of the publication and the time of publication were clustered to make up for the lack of clustering using only the title.
如图7所示,是本发明文献归一方法的实施例二的流程示意图,该文献归一方法包括:As shown in FIG. 7, it is a schematic flowchart of Embodiment 2 of the normalization method of the document of the present invention, and the normalization method of the document includes:
S20,获取所有网站来源的文献。S20, obtain documents from all website sources.
具体地,通过网络爬虫的方式从所有网站获取文献。Specifically, documents are obtained from all websites by way of web crawling.
S21,对所获取的文献进行标准化。S21, standardizing the acquired documents.
在本发明的实施例中,所述标准化是对文件的属性进行标准化,所述文献的属性包括,标题、作者、摘要、发表来源、发表时间等。In an embodiment of the invention, the normalization is to normalize the attributes of the document, the attributes of the document including title, author, abstract, publication source, publication time, and the like.
具体地,对标题的标准化包括,对标题的切分、半角全角的统一化、去掉标题的标点等。Specifically, the standardization of the title includes the segmentation of the title, the unification of the full-width half-width, the removal of the punctuation of the title, and the like.
由于站点的作者可能缩写是不同的,需要对文献的作者进行标准化。对作者的标准化的原理是提取文献的第一作者的全名,将第一作者的全名切分成多个单词,提取每个单词的首字母,最后将提取的所有首字母排序成文献所对应的作者。在将第一作者的全名切分成多个单词时,当有多个大写字母缩写在一起时,将每个大写字母切分成一个单词。Since the authors of the site may have different abbreviations, the authors of the literature need to be standardized. The principle of standardization for the author is to extract the full name of the first author of the document, divide the full name of the first author into multiple words, extract the first letter of each word, and finally sort all the initials extracted into the corresponding documents. author. When the first author's full name is divided into multiple words, when there are multiple uppercase letters abbreviated together, each uppercase letter is divided into one word.
对摘要的标准化的原理为提取摘要的主体部分,计算主体部分中句子的长度,找出长度最长的句子,计算文献的摘要的签名。在其他实施方式中,也可以是其他长度的句子。可以利用消息摘要算法第五版(Message Digest Algorithm,MD5)计算文献的摘要的签名。The principle of standardization of the abstract is to extract the main part of the abstract, calculate the length of the sentence in the main part, find the sentence with the longest length, and calculate the signature of the abstract of the document. In other embodiments, sentences of other lengths may also be used. The signature of the digest of the document can be calculated using the Message Digest Algorithm (MD5).
发表来源包括期刊、会议、文献集等。对发表来源的标准化主要是统一发表来源的格式,包括统一大小写、删除符号、半角全角的统一化等。Publication sources include journals, conferences, and collections. The standardization of the source of publication is mainly to unify the format of the source of publication, including uniformization of uppercase and lowercase, deletion of symbols, and unification of full-width corners.
对发表时间的标准化包括,从发表时间中提取年份。在网络上,文 献的发表时间会有各种不同的时间格式,对发表时间的标准化包括从各种不同的时间格式中提取年份。当然,除了仅提取年份的方式之外,也可以采用统一成相同表述的方式。Standardization of publication time includes the extraction of the year from the publication time. On the web, the text The publication time will be in a variety of time formats, and the standardization of publication time includes the extraction of years from a variety of different time formats. Of course, in addition to the method of extracting only the year, it is also possible to adopt a method of unifying the same expression.
S22,根据标准化后的文献的标题的相似度,将相似标题的文献进行聚类得到多个第一集合,并行地根据标准化后的文献的第一作者,发表来源及发表年份的相似度,将相似的文献进行聚类得到多个第二集合。S22, according to the similarity of the titles of the standardized documents, clustering the documents of similar titles to obtain a plurality of first sets, and according to the first author of the standardized documents, the similarity of the publishing source and the publication year, Similar documents are clustered to obtain multiple second sets.
S23,在每个第一集合中计算文献的相似度,根据所计算的文献的相似度筛选出多个符合条件的第一集合,及在每个第二集合中计算文献的相似度,根据所计算的文献的相似度筛选出多个符合条件的第二集合。S23. Calculate the similarity of the document in each first set, select a plurality of first sets that meet the condition according to the similarity of the calculated documents, and calculate the similarity of the documents in each second set, according to the The similarity of the calculated documents selects a plurality of eligible second sets.
具体地,预先设置文献属性所对应的权重,所述文献属性可以是作者、摘要、发表来源、发表时间等特征。在每一个第一集合及第二集合中,根据预先设置的各文献所对应的权重,计算每个第一集合及第二集合中各文献的相似度,将各文献的相似度大于预设总的第一集合或者第二集合中确定为一个符合条件的第一集合或者第二集合。Specifically, the weight corresponding to the document attribute is set in advance, and the document attribute may be a feature such as an author, a summary, a publication source, and a publication time. In each of the first set and the second set, the similarity of each document in each of the first set and the second set is calculated according to the weight corresponding to each document set in advance, and the similarity of each document is greater than the preset total The first set or the second set is determined to be an eligible first set or second set.
S24,对筛选出的多个符合条件的第一集合及筛选出的多个符合条件的第二集合,进行相同的文献进行聚类,并将相同的文献的发表来源进行汇总。可以将相同的文献的发表来源的链接进行汇总。S24: Perform clustering of the same document on the plurality of selected first sets that are selected and the plurality of selected second sets that are selected, and summarize the publication sources of the same documents. Links to the publication sources of the same literature can be summarized.
具体地,分别针对筛选出的每个符合条件的第一集合及第二集合执行键值对形成过程,一个符合条件的第一集合或者第二集合中所述键值对形成过程包括:分别将各文献作为key,其他文献作为该key对应的value,从而形成至少两个key-value对;依据得到的所有key-value对,将key相同的key-value对聚类到一个集;分别针对得到的集合转至执行所述键值对形成过程,直至达到预设的迭代次数,将筛选出的多个符合 条件的第一集合及第二集合中相同的文献聚合到一个类中。所述预设的迭代次数为经验值。Specifically, the key value pair forming process is performed separately for each of the selected first set and the second set that are selected, and the forming process of the key value pair in the first set or the second set that meets the condition includes: respectively Each document is used as a key, and other documents are used as values corresponding to the key, thereby forming at least two key-value pairs; according to all the key-value pairs obtained, the key-value pairs having the same key are clustered into one set; The collection proceeds to perform the key-value pair formation process until the preset number of iterations is reached, and the plurality of selected matches are The same set of conditions and the same documents in the second set are aggregated into one class. The preset number of iterations is an empirical value.
可以利用mapreduce模型将筛选出的多个符合条件的第一集合及第二集合中相同的文献进行聚类。具体地,将筛选出的每个符合条件的第一集合及第二集合作为map阶段的输入,在map阶段输出每个符合条件的第一集合及第二集合所对应的key-value对。将所有筛选出的多个符合条件的第一集合及第二集合所对应的key-value对中key进行排序,将排序后的所有key-value对作为reduce阶段的输入数据,在reduce阶段将key相同的key-value对聚类到一个集合中,这样reduce阶段会输出多个集合,每个集合中的文献再组成多个key-value对作为reduce阶段的输入,利用上述方法多次迭代直至达到预设的迭代次数,将筛选出的多个符合条件的第一集合及第二集合中相同的文献聚合到一个类中,该类中包括该篇文献的所有发表来源。The mapreduce model may be used to cluster the plurality of selected first sets and the same documents in the second set. Specifically, each of the selected first set and the second set are selected as input of a map stage, and a key-value pair corresponding to each of the first set and the second set that meets the condition is output in the map stage. All the selected key-value pairs corresponding to the first set and the second set are sorted, and all sorted key-value pairs are used as input data of the reduce stage, and the key is used in the reduce stage. The same key-value pairs are clustered into a set, so that the reduce stage outputs multiple sets, and the documents in each set form a plurality of key-value pairs as input of the reduce stage, and are iterated multiple times by using the above method until reaching The preset number of iterations aggregates the selected plurality of eligible first and second collections into a class, including all publication sources of the document.
进一步,在其他实施方式中,一种利用本实施例中的文献归一方法的搜索方法中包括接收用户输入的关键词;根据所述关键词,匹配出与所述关键词所有相关联的文献;将所有相关联的文献及每篇相关联的文献汇总后的发表来源发送给用户。具体地,将每篇相关联的文献汇总后的发表来源的链接显示给用户。这样用户将同一篇文章的不同发表来源链接汇聚到一起,提升了用户体验。Further, in other embodiments, a search method using the document normalization method in the embodiment includes receiving a keyword input by a user; and matching, according to the keyword, all the documents associated with the keyword Send the published source of all associated documents and each associated document to the user. Specifically, a link to the publication source of each of the associated documents is displayed to the user. In this way, the user brings together the different publishing source links of the same article, which improves the user experience.
优选地,作为S22的一种实施方式,所述文献的标题的相似度根据文献的标题签名之间的相似度及文献的标题之间的海明距离来确定。所述文献的作者,发表来源及发表年份的相似度的确定是先将标准化后的文献的作者,发表来源及发表年份合并为字符串,计算合并后的字符串 的签名,再根据文献的合并后的字符串的签名之间的相似度及文献的合并后的字符串之间的海明距离来确定。则S22中包括:Preferably, as an embodiment of S22, the similarity of the titles of the documents is determined according to the similarity between the title signatures of the documents and the Hamming distance between the titles of the documents. The author of the document, the source of the publication and the similarity of the publication year are determined by combining the author of the standardized document, the publication source and the publication year into a string, and calculating the combined string. The signature is then determined based on the similarity between the signatures of the merged strings of the document and the Hamming distance between the merged strings of the document. Then S22 includes:
S220,根据标准化后文献的标题计算文献的标题的签名,及将标准化后的文献的第一作者,发表来源及发表年份合并为字符串,计算合并后的字符串的签名。S220: Calculate the signature of the title of the document according to the title of the standardized document, and combine the first author of the standardized document, the publication source and the publication year into a character string, and calculate the signature of the merged character string.
例如,文献的第一作者为MC Murphy,发表来源为Journal of Applied Computational Mathematics,发表时间为,1999,合并成字符串为MC Murphy/Journal of Applied Computational Mathematics/1999。For example, the first author of the literature is MC Murphy, published under the journal Journal of Applied Computational Mathematics, published in 1999, and merged into a string called MC Murphy/Journal of Applied Computational Mathematics/1999.
S221,根据每篇文献的标题的签名,将两个标题相似的文献进行聚类,得到多个第一簇,及根据每篇文献的合并后的字符串的签名,将两个合并后的字符串相似的文献进行聚类,得到多个第二簇。所述第一簇或者第二簇至少包括两篇文献。S221, according to the signature of the title of each document, clustering two documents with similar titles to obtain a plurality of first clusters, and combining the two merged characters according to the signature of the merged string of each document. Strings of similar documents are clustered to obtain a plurality of second clusters. The first cluster or the second cluster includes at least two documents.
具体地,对任意一个标题的签名执行键值对形成过程,先将该标题的签名切分成T份数,所述T为预设值,将该标题的每一个分块作为key,该标题的签名作为value,这样该标题会对应T个key-value对。按照上述方法,每个标题会对应的T个key-value对。当有两个标题各自对应的T个key-value对中,至少有一个key相同时,将这两个标题所对应的文献聚类成一个第一簇输出。同理,针对每篇文献的合并后的字符串的签名,执行上述方法。当两个合并后的字符串各自对应的T个key-value对中,至少有一个key相同时,将这两个合并后的字符串所对应的文献聚类成一个第二簇输出。Specifically, the key value pair forming process is performed on the signature of any one of the titles, and the signature of the title is first divided into T parts, and the T is a preset value, and each piece of the title is used as a key, and the title is Signature as value, so the title will correspond to T key-value pairs. According to the above method, each title will correspond to T key-value pairs. When there are at least one key in the T key-value pairs corresponding to the two titles, the documents corresponding to the two titles are clustered into a first cluster output. Similarly, the above method is performed for the signature of the merged character string of each document. When at least one of the T key-value pairs corresponding to the two merged strings is the same, the documents corresponding to the two merged strings are clustered into a second cluster output.
可以利用mapreduce模型将标题相似的文献进行聚类,得到多个第一簇,所述mapreduce模型包括map阶段和reduce阶段。输入数据经过 map处理,再经过reduce处理,最终得到输出数据。map阶段的输出是key-value对的形式。将每个标题的T个分块分别作为map阶段的输入,在则map阶段输出每个标题所对应的T个key-value对。在reduce阶段,当两个标题各自对应的T个key-value对中,至少有一个key相同时,reduce阶段将这两个标题所对应的文献聚类成一个第一簇输出。同理,可以利用mapreduce模型将两个合并后的字符串相似的文献进行聚类,得到多个第二簇。The documents with similar titles can be clustered by using the mapreduce model to obtain a plurality of first clusters, and the mapreduce model includes a map phase and a reduce phase. Input data The map is processed, and then processed by reduce, and finally the output data is obtained. The output of the map phase is in the form of a key-value pair. The T blocks of each title are respectively input as the map stage, and the T key-value pairs corresponding to each title are output in the map stage. In the reduce phase, when at least one key is the same among the T key-value pairs corresponding to the two titles, the reduce stage clusters the documents corresponding to the two titles into a first cluster output. Similarly, the mapreduce model can be used to cluster two merged strings with similar documents to obtain multiple second clusters.
S222,根据每个第一簇中各文献的标题的签名,计算每个第一簇中各文献间的海明距离,及根据每个第二簇中各篇文献的合并后的字符串的签名,计算每个第二簇中各文献间的海明距离。S222. Calculate, according to the signature of the title of each document in each first cluster, a Hamming distance between each document in each first cluster, and a signature of the combined character string according to each document in each second cluster. Calculate the Hamming distance between each document in each second cluster.
如果两个标题签名的某一位置上对应的值不同,那么海明距离是1。若有两个位置上对应的值不同,那么海明距离是2,依次类推。如果两个合并后的字符串的签名的某一位置上对应的值不同,那么海明距离是1。若有两个位置上对应的值不同,那么海明距离是2,依次类推。If the corresponding values at a certain position of the two title signatures are different, the Hamming distance is 1. If there are different values in the two positions, then the Hamming distance is 2, and so on. If the corresponding values at a certain position of the signature of the two merged strings are different, the Hamming distance is 1. If there are different values in the two positions, then the Hamming distance is 2, and so on.
S223,筛选出海明距离小于或等于预设阈值的第一簇,所筛选出的多个符合条件的第一簇即为将相似标题的文献进行聚类所得到的多个第一集合,及筛选出海明距离小于或等于预设阈值的第二簇,所筛选出的多个符合条件的第二簇即为将相似的合并后的字符串所对应的文献进行聚类所得到的多个第二集合。S223: Filter out a first cluster whose Hamming distance is less than or equal to a preset threshold, and select a plurality of first clusters that meet the condition to be a plurality of first sets obtained by clustering documents of similar titles, and screening The second cluster with the Hamming distance less than or equal to the preset threshold is selected, and the selected second clusters that are selected are the plurality of second obtained by clustering the documents corresponding to the similar merged character strings. set.
在其他实施方式中,所述文献的标题的相似度可以根据文献的标题签名之间的相似度或者文献的标题之间的海明距离来确定。所述文献的作者,发表来源及发表年份的相似度的可以根据文献的合并后的字符串的签名之间的相似度或者文献的合并后的字符串之间的海明距离来确定。 In other embodiments, the similarity of the titles of the documents may be determined based on the similarity between the title signatures of the documents or the Hamming distance between the titles of the documents. The author of the document, the publication source and the degree of similarity of the publication year may be determined based on the similarity between the signatures of the merged character strings of the document or the Hamming distance between the merged strings of the documents.
优选地,作为S220的一种实施方式,在S220中还可包括:Preferably, as an implementation manner of S220, in S220, the method further includes:
(1)将文献的标题切分成多个子标题,如可以按照大写字母切分。计算每个子标题的长度,提取子标题的长度大于预设长度的子标题。(1) Divide the title of the document into multiple subtitles, such as by dividing the uppercase letters. Calculate the length of each subtitle, and extract subtitles whose subtitles are longer than the preset length.
(2)将合并后的字符串切分成多个子字符串,计算每个子字符串的长度,提取子字符串的长度大于预设长度的子字符串。(2) Dividing the merged string into multiple substrings, calculating the length of each substring, and extracting the substring whose substring is longer than the preset length.
(3)确定所提取的子标题的n-gram特征,所述n的取值从1到N,所述N的取值根据所提取的子标题的长度设定。(3) determining an n-gram feature of the extracted subtitle, the value of n being from 1 to N, and the value of N is set according to the length of the extracted subtitle.
(4)确定所提取的子字符串的n-gram特征。(4) Determine the n-gram feature of the extracted substring.
(5)依据所确定的n-gram特征,计算文献的标题的签名。可以利用simhash算法计算文献的标题的签名,计算出来的文献的标题的签名是由0及1组成的n位签名。(5) Calculate the signature of the title of the document based on the determined n-gram characteristics. The signature of the title of the document can be calculated using the simhash algorithm, and the signature of the title of the calculated document is an n-bit signature consisting of 0 and 1.
(6)确定所提取的子字符串的n-gram特征,计算文献的合并后的字符串的签名。可以利用simhash算法计算合并后的字符串的签名,计算出来的合并后的字符串的签名是由0及1组成的n位签名。(6) Determine the n-gram feature of the extracted substring, and calculate the signature of the merged character string of the document. The simhash algorithm can be used to calculate the signature of the merged string. The calculated signature of the merged string is an n-bit signature consisting of 0 and 1.
实施例三Embodiment 3
图8是本发明文献归一的装置的实施例的结构示意框图。参照图8所示,该装置包括:获取单元100、标准化单元101、第一聚类单元102、第一筛选单元103及第二聚类单元104。Figure 8 is a schematic block diagram showing the construction of an embodiment of the apparatus of the present invention. Referring to FIG. 8, the apparatus includes: an obtaining unit 100, a normalization unit 101, a first clustering unit 102, a first screening unit 103, and a second clustering unit 104.
获取单元100用于获取所有网站来源的文献。The obtaining unit 100 is configured to obtain documents of all website sources.
具体地,通过网络爬虫的方式从所有网站获取文献。Specifically, documents are obtained from all websites by way of web crawling.
标准化单元101用于对所获取的文献进行标准化。The normalization unit 101 is for standardizing the acquired documents.
在本发明的实施例中,所述标准化是对文件的属性进行标准化,所述文献的属性包括,标题、作者、摘要、发表来源、发表时间等进行标 准化。In an embodiment of the present invention, the standardization is to standardize attributes of a document, and the attributes of the document include a title, an author, an abstract, a publication source, a publication time, and the like. Normalization.
具体地,标准化单元101对标题的标准化包括,对标题的切分半角全角的统一化去掉标题的标点等。Specifically, the normalization of the title by the normalization unit 101 includes unification of the full-width of the segmentation half-width of the title, removal of the punctuation of the title, and the like.
标准化单元101对作者的标准化的原理是提取文献的第一作者的全名,将第一作者的全名切分成多个单词,提取每个单词的首字母,最后将提取的所有首字母排序成文献所对应的作者。在将第一作者的全名切分成多个单词时,当有多个大写字母缩写在一起时,将每个大写字母切分成一个单词。The principle of the standardization unit 101 standardizing the author is to extract the full name of the first author of the document, divide the full name of the first author into a plurality of words, extract the initials of each word, and finally sort all the initials extracted into The author of the document. When the first author's full name is divided into multiple words, when there are multiple uppercase letters abbreviated together, each uppercase letter is divided into one word.
标准化单元101对摘要的标准化的原理为提取摘要的主体部分,计算主体部分中句子的长度,找出长度最长的句子,计算文献的摘要的签名。在其他实施方式中,也可以是其他长度的句子。可以利用消息摘要算法第五版(Message Digest Algorithm,MD5)计算文献的摘要的签名。The principle of standardization of the abstract by the normalization unit 101 is to extract the main part of the abstract, calculate the length of the sentence in the main part, find the sentence with the longest length, and calculate the signature of the abstract of the document. In other embodiments, sentences of other lengths may also be used. The signature of the digest of the document can be calculated using the Message Digest Algorithm (MD5).
发表来源包括,期刊,会议,文献集等。标准化单元101对发表来源的标准化主要是统一发表来源的格式,包括统一大小写、删除符号、半角全角的统一化等。Publication sources include journals, conferences, and collections. The standardization unit 101 standardizes the source of the publication mainly by the format of the unified publication source, including unified capitalization, deletion of symbols, and unification of the full-width half-width.
标准化单元101对发表时间的标准化包括,从发表时间中提取年份。在网络上,文献的发表时间会有各种不同的时间格式,标准化单元101可以从各种不同的时间格式中提取年份。当然,除了仅提取年份的方式之外,也可以采用统一成相同表述的方式。The normalization of the publication time by the normalization unit 101 includes extracting the year from the publication time. On the network, the publication time of the document has various time formats, and the normalization unit 101 can extract the year from various different time formats. Of course, in addition to the method of extracting only the year, it is also possible to adopt a method of unifying the same expression.
第一聚类单元102用于根据标准化后的文献的标题的相似度,将相似标题的文献进行聚类得到多个第一集合。所述第一集合包括至少两篇文献。The first clustering unit 102 is configured to cluster the documents of similar titles according to the similarity of the titles of the standardized documents to obtain a plurality of first sets. The first set includes at least two documents.
第一筛选单元103用于在每个第一集合中计算文献的相似度,根据 所计算的文献的相似度筛选出多个符合条件的第一集合。The first screening unit 103 is configured to calculate the similarity of the documents in each of the first sets, according to The similarity of the calculated documents filters out a plurality of first sets that meet the criteria.
具体地,第一筛选单元103用于:预先设置文献属性所对应的权重,所述文献属性可以是作者、摘要、发表来源、发表时间等特征。在每一个第一集合中,根据预先设置的文献属性所对应的权重,计算每个第一集合中各文献的相似度,将各文献的相似度大于预设总分的第一集合确定为符合条件的第一集合。Specifically, the first screening unit 103 is configured to: preset a weight corresponding to the document attribute, where the document attribute may be an author, a summary, a publication source, a publication time, and the like. In each of the first sets, the similarity of each document in each first set is calculated according to the weight corresponding to the preset document attribute, and the first set whose similarity of each document is greater than the preset total score is determined to be in accordance with The first set of conditions.
第二聚类单元104用于将筛选出的多个符合条件的第一集合中相同的文献进行聚类,并将相同的文献的发表来源进行汇总。可以将相同的文献的发表来源的链接进行汇总。The second clustering unit 104 is configured to cluster the same documents in the selected plurality of eligible first sets, and summarize the publication sources of the same documents. Links to the publication sources of the same literature can be summarized.
具体地,第二聚类单元104用于:分别针对筛选出的每个符合条件的第一集合执行键值对形成过程,一个符合条件的第一集合中所述键值对形成过程包括:分别将各文献作为key,其他文献作为该key对应的value,从而形成至少两个key-value对;依据得到的所有key-value对,将key相同的key-value对聚类到一个集;分别针对得到的集合转至执行所述键值对形成过程,直至达到预设的迭代次数,将筛选出的多个符合条件的第一集合中相同的文献聚合到一个类中。Specifically, the second clustering unit 104 is configured to: perform a key-value pair forming process for each of the selected first sets that are selected, and the key-value pair forming process in the first set that meets the condition includes: respectively Each document is used as a key, and other documents are used as the value corresponding to the key, thereby forming at least two key-value pairs; according to all the key-value pairs obtained, the key-value pairs having the same key are clustered into one set; The obtained set is transferred to perform the key value pair forming process until the preset number of iterations is reached, and the same documents in the selected plurality of eligible first sets are aggregated into one class.
可以利用mapreduce模型将筛选出的多个符合条件的第一集合中相同的文献进行聚类。具体地,将筛选出的每个符合条件的第一集合作为map阶段的输入,在map阶段输出每个符合条件的第一集合所对应的key-value对。将所有筛选出的多个符合条件的第一集合所对应的key-value对中key进行排序,将排序后的所有key-value对作为reduce阶段的输入数据,在reduce阶段将key相同的key-value对聚类到一个集合中,这样reduce阶段会输出多个集合,每个集合中的文献再组成多 个key-value对作为reduce阶段的输入,利用上述方法多次迭代直至达到预设的迭代次数,将筛选出的多个符合条件的第一集合中相同的文献聚合到一个类中,该类中包括该篇文献的所有发表来源。The mapreduce model may be used to cluster the same documents in the plurality of eligible first sets that are selected. Specifically, each of the selected first sets that are filtered out is used as an input of a map stage, and a key-value pair corresponding to each of the eligible first sets is output in the map stage. All the key-value pairs corresponding to the first selected first set are sorted, and all the key-value pairs after sorting are used as the input data of the reduce stage, and the keys with the same key are in the reduce stage. The value pairs are clustered into a collection, so the reduce phase will output multiple collections, and the documents in each collection will be composed more. The key-value pair is used as the input of the reduce phase, and is iterated multiple times by using the above method until the preset number of iterations is reached, and the same documents in the selected plurality of eligible first sets are aggregated into one class, in the class. Includes all publication sources for this article.
如图9所示,是本发明第一聚类单元102的实施例的结构示意图。第一聚类单元102包括签名计算单元1020、签名聚类单元1021、距离计算单元1022及第二筛选单元1023。As shown in FIG. 9, it is a schematic structural diagram of an embodiment of the first clustering unit 102 of the present invention. The first clustering unit 102 includes a signature calculation unit 1020, a signature clustering unit 1021, a distance calculation unit 1022, and a second screening unit 1023.
签名计算单元1020用于根据标准化后的文献的标题计算文献的标题的签名。The signature calculation unit 1020 is configured to calculate a signature of the title of the document based on the title of the standardized document.
签名聚类单元1021用于根据每篇文献的标题的签名,将两个标题相似的文献进行聚类,得到多个第一簇。所述第一簇至少包括两篇文献。The signature clustering unit 1021 is configured to cluster two documents with similar titles according to the signature of the title of each document to obtain a plurality of first clusters. The first cluster includes at least two documents.
具体地,签名聚类单元1021用于:对任意一个标题的签名执行键值对形成过程,先将该标题的签名切分成T份数,所述T为预设值,将该标题的每一个分块作为key,该标题的签名作为value,这样该标题会对应T个key-value对。按照上述方法,每个标题会对应的T个key-value对。当有两个标题各自对应的T个key-value对中,至少有一个key相同时,将这两个标题所对应的文献聚类成一个第一簇输出。Specifically, the signature clustering unit 1021 is configured to: perform a key-value pair forming process on the signature of any one of the titles, first divide the signature of the title into T-numbers, and the T is a preset value, and each of the titles The block is used as the key, and the signature of the title is used as the value, so that the title corresponds to T key-value pairs. According to the above method, each title will correspond to T key-value pairs. When there are at least one key in the T key-value pairs corresponding to the two titles, the documents corresponding to the two titles are clustered into a first cluster output.
可以利用mapreduce模型将标题相似的文献进行聚类,得到多个第一簇,所述mapreduce模型包括map阶段和reduce阶段。输入数据经过map处理,再经过reduce处理,最终得到输出数据。map阶段的输出是key-value对的形式。将每个标题的T个分块分别作为map阶段的输入,在则map阶段输出每个标题所对应的T个key-value对。在reduce阶段,当两个标题各自对应的T个key-value对中,至少有一个key相同时,reduce阶段将这两个标题所对应的文献聚类成一个第一簇输出。 The documents with similar titles can be clustered by using the mapreduce model to obtain a plurality of first clusters, and the mapreduce model includes a map phase and a reduce phase. The input data is processed by the map, and then subjected to reduce processing to finally obtain the output data. The output of the map phase is in the form of a key-value pair. The T blocks of each title are respectively input as the map stage, and the T key-value pairs corresponding to each title are output in the map stage. In the reduce phase, when at least one key is the same among the T key-value pairs corresponding to the two titles, the reduce stage clusters the documents corresponding to the two titles into a first cluster output.
距离计算单元1022用于根据每个第一簇中文献的标题的签名,计算每个第一簇中文献间的海明距离。The distance calculation unit 1022 is configured to calculate a Hamming distance between documents in each of the first clusters based on the signature of the title of the document in each of the first clusters.
第二筛选单元1023筛选出海明距离小于或等于预设阈值的第一簇,所筛选出的多个符合条件的第一簇即为将相似标题的文献进行聚类所得到的多个第一集合。The second screening unit 1023 selects a first cluster whose Hamming distance is less than or equal to a preset threshold, and the selected plurality of eligible first clusters are a plurality of first sets obtained by clustering documents of similar titles. .
在上述实施例中,所述文献的标题的相似度根据文献的标题签名之间的相似度及文献的标题之间的海明距离来确定。在其他实施方式中,所述文献的标题的相似度可以根据文献的标题签名之间的相似度或者文献的标题之间的海明距离来确定。In the above embodiment, the similarity of the titles of the documents is determined based on the similarity between the title signatures of the documents and the Hamming distance between the titles of the documents. In other embodiments, the similarity of the titles of the documents may be determined based on the similarity between the title signatures of the documents or the Hamming distance between the titles of the documents.
如图10所示,是本发明签名计算单元的实施例的结构示意图。签名计算单1020包括提取单元10201、确定单元10202及计算单元10203。FIG. 10 is a schematic structural diagram of an embodiment of a signature calculation unit of the present invention. The signature calculation form 1020 includes an extraction unit 10201, a determination unit 10202, and a calculation unit 10203.
提取单元10201用于将文献的标题切分成多个子标题,如可以按照大写字母切分。计算每个子标题的长度,提取子标题的长度大于预设长度的子标题。The extracting unit 10201 is configured to divide the title of the document into a plurality of subtitles, such as may be divided into uppercase letters. Calculate the length of each subtitle, and extract subtitles whose subtitles are longer than the preset length.
确定单元10202用于确定所提取的子标题的n-gram特征,所述n的取值从1到N,所述N的取值根据所提取的子标题的长度设定。The determining unit 10202 is configured to determine an n-gram feature of the extracted subtitle, the value of the n is from 1 to N, and the value of the N is set according to the length of the extracted subtitle.
计算单元10203用于依据所确定的n-gram特征,计算文献的标题的签名。可以利用simhash算法计算文献的标题的签名,计算出来文献的标题的签名是由0及1组成的n位签名。例如为64位签名、16位签名等。The calculating unit 10203 is configured to calculate a signature of a title of the document according to the determined n-gram feature. The signature of the title of the document can be calculated using the simhash algorithm, and the signature of the title of the document is calculated to be an n-bit signature consisting of 0 and 1. For example, it is a 64-bit signature, a 16-bit signature, and the like.
实施例四Embodiment 4
参照图8所示,所述装置中的获取单元100、标准化单元101、第一聚类单元102、第一筛选单元103及第二聚类单元104还用于实施例四 中。具体如下:Referring to FIG. 8 , the acquiring unit 100, the normalizing unit 101, the first clustering unit 102, the first screening unit 103, and the second clustering unit 104 in the device are also used in the fourth embodiment. in. details as follows:
获取单元100用于获取所有网站来源的文献。The obtaining unit 100 is configured to obtain documents of all website sources.
具体地,通过网络爬虫的方式从所有网站获取文献。Specifically, documents are obtained from all websites by way of web crawling.
标准化单元101用于对所获取的文献进行标准化。The normalization unit 101 is for standardizing the acquired documents.
在本发明的实施例中,所述标准化是对文件的属性进行标准化,所述文献的属性包括,标题、作者、摘要、发表来源、发表时间等。In an embodiment of the invention, the normalization is to normalize the attributes of the document, the attributes of the document including title, author, abstract, publication source, publication time, and the like.
具体地,标准化单元101用于对标题的标准化包括,对标题的切分、半角全角的统一化、去掉标题的标点等。Specifically, the normalization unit 101 is used for normalization of the title, including segmentation of the title, unification of the full-width half-width, removal of the punctuation of the title, and the like.
标准化单元101对作者的标准化的原理是提取文献的第一作者的全名,将第一作者的全名切分成多个单词,提取每个单词的首字母,最后将提取的所有首字母排序成文献所对应的作者。在将第一作者的全名切分成多个单词时,当有多个大写字母缩写在一起时,将每个大写字母切分成一个单词。The principle of the standardization unit 101 standardizing the author is to extract the full name of the first author of the document, divide the full name of the first author into a plurality of words, extract the initials of each word, and finally sort all the initials extracted into The author of the document. When the first author's full name is divided into multiple words, when there are multiple uppercase letters abbreviated together, each uppercase letter is divided into one word.
标准化单元101对摘要的标准化的原理为提取摘要的主体部分,计算主体部分中句子的长度,找出长度最长的句子,计算文献的摘要的签名。在其他实施方式中,也可以是其他长度的句子。可以利用消息摘要算法第五版(Message Digest Algorithm,MD5)计算文献的摘要的签名。The principle of standardization of the abstract by the normalization unit 101 is to extract the main part of the abstract, calculate the length of the sentence in the main part, find the sentence with the longest length, and calculate the signature of the abstract of the document. In other embodiments, sentences of other lengths may also be used. The signature of the digest of the document can be calculated using the Message Digest Algorithm (MD5).
发表来源包括期刊,会议,文献集等。标准化单元101对发表来源的标准化主要是统一发表来源的格式,包括统一大小写、删除符号、半角全角的统一化等。Publication sources include journals, conferences, and collections. The standardization unit 101 standardizes the source of the publication mainly by the format of the unified publication source, including unified capitalization, deletion of symbols, and unification of the full-width half-width.
标准化单元101对发表时间的标准化包括,从发表时间中提取年份。在网络上,文献的发表时间会有各种不同的时间格式,标准化单元101可以从各种不同的时间格式中提取年份。当然,除了仅提取年份的方式 之外,也可以采用统一成相同表述的方式。The normalization of the publication time by the normalization unit 101 includes extracting the year from the publication time. On the network, the publication time of the document has various time formats, and the normalization unit 101 can extract the year from various different time formats. Of course, except for the way only the year is extracted In addition, it is also possible to adopt a method of unifying the same expression.
第一聚类单元102用于根据标准化后的文献的标题的相似度,将相似标题的文献进行聚类得到多个第一集合,并行地根据标准化后的文献的第一作者,发表来源及发表年份的相似度,将相似的文献进行聚类得到多个第二集合。The first clustering unit 102 is configured to cluster the documents of similar titles according to the similarity of the titles of the standardized documents to obtain a plurality of first sets, and in parallel according to the first author of the standardized documents, publish the source and publish The similarity of the years, clustering similar documents to obtain multiple second sets.
第一筛选单元103用于在每个第一集合中计算文献的相似度,根据所计算的文献的相似度筛选出多个符合条件的第一集合,及在每个第二集合中计算文献的相似度,根据所计算的文献的相似度筛选出多个符合条件的第二集合。The first screening unit 103 is configured to calculate the similarity of the documents in each first set, select a plurality of first sets that meet the conditions according to the similarity of the calculated documents, and calculate the documents in each second set. Similarity, a plurality of eligible second sets are screened according to the similarity of the calculated documents.
具体地,第一筛选单元103用于:在每一个第一集合及第二集合中,根据预先设置的各文献所对应的权重,计算每个第一集合及第二集合中各文献的相似度,将各文献的相似度大于预设总的第一集合或者第二集合中确定为一个符合条件的第一集合或者第二集合。Specifically, the first screening unit 103 is configured to: in each of the first set and the second set, calculate a similarity of each document in each of the first set and the second set according to a weight corresponding to each document set in advance And determining, in the first set or the second set, the similarity of each document is determined to be a first set or a second set that meets the condition.
第二聚类单元104用于对筛选出的多个符合条件的第一集合及筛选出的多个符合条件的第二集合,进行相同的文献进行聚类,并将相同的文献的发表来源进行汇总。可以将相同的文献的发表来源的链接进行汇总。The second clustering unit 104 is configured to cluster the plurality of selected first sets that are matched and the plurality of selected second sets that are selected, perform the same document, and perform the same document publishing source. Summary. Links to the publication sources of the same literature can be summarized.
具体地,第二聚类单元104用于:分别针对筛选出的每个符合条件的第一集合及第二集合执行键值对形成过程,一个符合条件的第一集合或者第二集合中所述键值对形成过程包括:分别将各文献作为key,其他文献作为该key对应的value,从而形成至少两个key-value对;依据得到的所有key-value对,将key相同的key-value对聚类到一个集;分别针对得到的集合转至执行所述键值对形成过程,直至达到预设的迭代 次数,将筛选出的多个符合条件的第一集合及第二集合中相同的文献聚合到一个类中。Specifically, the second clustering unit 104 is configured to: perform a key value pair forming process for each of the selected first set and the second set that are selected, respectively, in an eligible first set or second set The key value pair formation process includes: respectively, each document is used as a key, and other documents are used as values corresponding to the key, thereby forming at least two key-value pairs; according to all the key-value pairs obtained, the key-value pairs having the same key Clustering to a set; respectively, for the obtained set, proceeding to execute the key-value pair forming process until a preset iteration is reached The number of times, the selected plurality of eligible first and second sets of the same documents are aggregated into one class.
可以利用mapreduce模型将筛选出的多个符合条件的第一集合及第二集合中相同的文献进行聚类。具体地,将筛选出的每个符合条件的第一集合及第二集合作为map阶段的输入,在map阶段输出每个符合条件的第一集合及第二集合所对应的key-value对。将所有筛选出的多个符合条件的第一集合及第二集合所对应的key-value对中key进行排序,将排序后的所有key-value对作为reduce阶段的输入数据,在reduce阶段将key相同的key-value对聚类到一个集合中,这样reduce阶段会输出多个集合,每个集合中的文献再组成多个key-value对作为reduce阶段的输入,利用上述方法多次迭代直至达到预设的迭代次数,将筛选出的多个符合条件的第一集合及第二集合中相同的文献聚合到一个类中,该类中包括该篇文献的所有发表来源。The mapreduce model may be used to cluster the plurality of selected first sets and the same documents in the second set. Specifically, each of the selected first set and the second set are selected as input of a map stage, and a key-value pair corresponding to each of the first set and the second set that meets the condition is output in the map stage. All the selected key-value pairs corresponding to the first set and the second set are sorted, and all sorted key-value pairs are used as input data of the reduce stage, and the key is used in the reduce stage. The same key-value pairs are clustered into a set, so that the reduce stage outputs multiple sets, and the documents in each set form a plurality of key-value pairs as input of the reduce stage, and are iterated multiple times by using the above method until reaching The preset number of iterations aggregates the selected plurality of eligible first and second collections into a class, including all publication sources of the document.
优选地,所述文献的标题的相似度根据文献的标题签名之间的相似度及文献的标题之间的海明距离来确定。所述文献的作者,发表来源及发表年份的相似度的确定是先将标准化后的文献的作者,发表来源及发表年份合并为字符串,计算合并后的字符串的签名,再根据文献的合并后的字符串的签名之间的相似度及文献的合并后的字符串之间的海明距离来确定。如图9所示,第一聚类单元102中的签名计算单元1020、签名聚类单元1021、距离计算单元1022及第二筛选单元1023还用于以下实施方式中。具体如下:Preferably, the similarity of the titles of the documents is determined according to the similarity between the title signatures of the documents and the Hamming distance between the titles of the documents. The author of the document, the source of publication and the similarity of the year of publication are determined by combining the author of the standardized document, the source of publication and the year of publication into a string, calculating the signature of the combined string, and then merging according to the literature. The similarity between the signatures of the subsequent strings and the Hamming distance between the merged strings of the documents are determined. As shown in FIG. 9, the signature calculation unit 1020, the signature clustering unit 1021, the distance calculation unit 1022, and the second screening unit 1023 in the first clustering unit 102 are also used in the following embodiments. details as follows:
签名计算单元1020用于根据标准化后文献的标题计算文献的标题的签名,及将标准化后的文献的第一作者,发表来源及发表年份合并为 字符串,计算合并后的字符串的签名。The signature calculation unit 1020 is configured to calculate a signature of the title of the document according to the title of the standardized document, and merge the first author, the publication source, and the publication year of the standardized document into A string that evaluates the signature of the merged string.
签名聚类单元1021用于根据每篇文献的标题的签名,将两个标题相似的文献进行聚类,得到多个第一簇,及根据每篇文献的合并后的字符串的签名,将两个合并后的字符串相似的文献进行聚类,得到多个第二簇。所述第一簇或者第二簇至少包括两篇文献。The signature clustering unit 1021 is configured to cluster two documents with similar titles according to the signature of the title of each document to obtain a plurality of first clusters, and according to the signature of the merged character string of each document, The merged strings are similarly clustered to obtain a plurality of second clusters. The first cluster or the second cluster includes at least two documents.
具体地,签名聚类单元1021用于:对任意一个标题的签名执行键值对形成过程,先将该标题的签名切分成T份数,所述T为预设值,将该标题的每一个分块作为key,该标题的签名作为value,这样该标题会对应T个key-value对。按照上述方法,每个标题会对应的T个key-value对。当有两个标题各自对应的T个key-value对中,至少有一个key相同时,将这两个标题所对应的文献聚类成一个第一簇输出。同理,针对每篇文献的合并后的字符串的签名,执行上述方法。当两个合并后的字符串各自对应的T个key-value对中,至少有一个key相同时,将这两个合并后的字符串所对应的文献聚类成一个第二簇输出。Specifically, the signature clustering unit 1021 is configured to: perform a key-value pair forming process on the signature of any one of the titles, first divide the signature of the title into T-numbers, and the T is a preset value, and each of the titles The block is used as the key, and the signature of the title is used as the value, so that the title corresponds to T key-value pairs. According to the above method, each title will correspond to T key-value pairs. When there are at least one key in the T key-value pairs corresponding to the two titles, the documents corresponding to the two titles are clustered into a first cluster output. Similarly, the above method is performed for the signature of the merged character string of each document. When at least one of the T key-value pairs corresponding to the two merged strings is the same, the documents corresponding to the two merged strings are clustered into a second cluster output.
可以利用mapreduce模型将标题相似的文献进行聚类,得到多个第一簇,所述mapreduce模型包括map阶段和reduce阶段。输入数据经过map处理,再经过reduce处理,最终得到输出数据。map阶段的输出是key-value对的形式。将每个标题的T个分块分别作为map阶段的输入,在则map阶段输出每个标题所对应的T个key-value对。在reduce阶段,当两个标题各自对应的T个key-value对中,至少有一个key相同时,reduce阶段将这两个标题所对应的文献聚类成一个第一簇输出。同理,可以利用mapreduce模型将两个合并后的字符串相似的文献进行聚类,得到多个第二簇。 The documents with similar titles can be clustered by using the mapreduce model to obtain a plurality of first clusters, and the mapreduce model includes a map phase and a reduce phase. The input data is processed by the map, and then subjected to reduce processing to finally obtain the output data. The output of the map phase is in the form of a key-value pair. The T blocks of each title are respectively input as the map stage, and the T key-value pairs corresponding to each title are output in the map stage. In the reduce phase, when at least one key is the same among the T key-value pairs corresponding to the two titles, the reduce stage clusters the documents corresponding to the two titles into a first cluster output. Similarly, the mapreduce model can be used to cluster two merged strings with similar documents to obtain multiple second clusters.
距离计算单元1022用于根据每个第一簇中各文献的标题的签名,计算每个第一簇中各文献间的海明距离,及根据每个第二簇中各文献的合并后的字符串的签名,计算每个第二簇中文献间的海明距离。The distance calculation unit 1022 is configured to calculate a Hamming distance between each document in each first cluster according to a signature of a title of each document in each first cluster, and a combined character according to each document in each second cluster The signature of the string calculates the Hamming distance between the documents in each second cluster.
如果两个标题签名的某一位置上对应的值不同,那么海明距离是1。若有两个位置上对应的值不同,那么海明距离是2,依次类推。如果两个合并后的字符串的签名的某一位置上对应的值不同,那么海明距离是1。若有两个位置上对应的值不同,那么海明距离是2,依次类推。If the corresponding values at a certain position of the two title signatures are different, the Hamming distance is 1. If there are different values in the two positions, then the Hamming distance is 2, and so on. If the corresponding values at a certain position of the signature of the two merged strings are different, the Hamming distance is 1. If there are different values in the two positions, then the Hamming distance is 2, and so on.
第二筛选单元1023用于筛选出海明距离小于或等于预设阈值的第一簇,所筛选出的多个符合条件的第一簇即为将相似标题的文献进行聚类所得到的多个第一集合,及筛选出海明距离小于或等于预设阈值的第二簇,所筛选出的多个符合条件的第二簇即为将相似的合并后的字符串所对应的文献进行聚类所得到的多个第二集合。The second screening unit 1023 is configured to filter out a first cluster whose Hamming distance is less than or equal to a preset threshold, and the plurality of selected first clusters that are selected are clusters obtained by clustering documents of similar titles. a set, and filtering out a second cluster whose Hamming distance is less than or equal to a preset threshold, and the selected second clusters that are selected are clustered by the documents corresponding to the similar merged character strings. Multiple second collections.
在其他实施方式中,所述文献的标题的相似度可以根据文献的标题签名之间的相似度或者文献的标题之间的海明距离来确定。所述文献的作者,发表来源及发表年份的相似度的可以根据文献的合并后的字符串的签名之间的相似度或者文献的合并后的字符串之间的海明距离来确定。In other embodiments, the similarity of the titles of the documents may be determined based on the similarity between the title signatures of the documents or the Hamming distance between the titles of the documents. The author of the document, the publication source and the degree of similarity of the publication year may be determined based on the similarity between the signatures of the merged character strings of the document or the Hamming distance between the merged strings of the documents.
如图10所示,签名计算单1020中的提取单元10201,确定单元10202及计算单元10203还用于以下实施方式中。具体如下:As shown in FIG. 10, the extracting unit 10201, the determining unit 10202, and the calculating unit 10203 in the signature calculation unit 1020 are also used in the following embodiments. details as follows:
提取单元10201用于将文献的标题切分成多个子标题,如可以按照大写字母切分。计算每个子标题的长度,提取子标题的长度大于预设长度的子标题。The extracting unit 10201 is configured to divide the title of the document into a plurality of subtitles, such as may be divided into uppercase letters. Calculate the length of each subtitle, and extract subtitles whose subtitles are longer than the preset length.
提取单元10201还用于将合并后的字符串切分成多个子字符串,计算每个子字符串的长度,提取子字符串的长度大于预设长度的子字符串。 The extracting unit 10201 is further configured to divide the merged character string into a plurality of substrings, calculate a length of each substring, and extract a substring whose length of the substring is greater than a preset length.
确定单元10202用于确定所提取的子标题的n-gram特征,所述n的取值从1到N,所述N的取值根据所提取的子标题的长度设定。The determining unit 10202 is configured to determine an n-gram feature of the extracted subtitle, the value of the n is from 1 to N, and the value of the N is set according to the length of the extracted subtitle.
确定单元10202还用于确定所提取的子字符串的n-gram特征。The determining unit 10202 is further configured to determine an n-gram feature of the extracted substring.
计算单元10203用于依据所确定的n-gram特征,计算文献的标题的签名。可以利用simhash算法计算文献的标题的签名,计算出来的文献的标题的签名是由0及1组成的n位签名。The calculating unit 10203 is configured to calculate a signature of a title of the document according to the determined n-gram feature. The signature of the title of the document can be calculated using the simhash algorithm, and the signature of the title of the calculated document is an n-bit signature consisting of 0 and 1.
计算单元10203还用于确定所提取的子字符串的n-gram特征,计算文献的合并后的字符串的签名。可以利用simhash算法计算合并后的字符串的签名,计算出来的合并后的字符串的签名是由0及1组成的n位签名。The calculating unit 10203 is further configured to determine an n-gram feature of the extracted substring, and calculate a signature of the merged character string of the document. The simhash algorithm can be used to calculate the signature of the merged string. The calculated signature of the merged string is an n-bit signature consisting of 0 and 1.
在上述四个实施例中,上述第一集合和第二集合仅是为了区分两种方式得到的文献集合所产生的表述上的差异。In the above four embodiments, the first set and the second set are only differences in expression for distinguishing between the sets of documents obtained in the two ways.
在其他实施例中,可以利用实施例一或者实施例二中所述文献归一方法进行搜索的装置,如图11所示,包括:接收单元200,匹配单元201及展现单元202。In other embodiments, the apparatus for searching using the document normalization method in the first embodiment or the second embodiment, as shown in FIG. 11, includes: a receiving unit 200, a matching unit 201, and a presentation unit 202.
接收单元200,用于接收用户输入的关键词。The receiving unit 200 is configured to receive a keyword input by a user.
匹配单元201,用于根据所述关键词,匹配出与所述关键词所有相关联的文献。The matching unit 201 is configured to match all the documents associated with the keyword according to the keyword.
展现单元202,用于将所有相关联的文献及每篇相关联的文献汇总后的发表来源发送给用户。The presentation unit 202 is configured to send the published source of all the associated documents and each associated document to the user.
在本发明所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分, 实际实现时可以有另外的划分方式。In the several embodiments provided by the present invention, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative, for example, the division of the elements is merely a logical functional division, There are other ways of dividing the actual implementation.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
上述以软件功能单元的形式实现的集成的单元,可以存储在一个计算机可读取存储介质中。上述软件功能单元存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本发明各个实施例所述方法的部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The above-described integrated unit implemented in the form of a software functional unit can be stored in a computer readable storage medium. The above software functional unit is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform the methods of the various embodiments of the present invention. Part of the steps. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明保护的范围之内。 The above are only the preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalents, improvements, etc., which are made within the spirit and principles of the present invention, should be included in the present invention. Within the scope of protection.

Claims (24)

  1. 一种文献归一方法,其特征在于,包括:A document normalization method, comprising:
    获取一个以上网站来源的文献;Obtain documents from more than one website source;
    对所获取的文献进行标准化;Standardize the documents obtained;
    根据标准化后的文献的标题的相似度,将相似标题的文献进行聚类得到多个文献集合;According to the similarity of the titles of the standardized documents, the documents of similar titles are clustered to obtain a plurality of document collections;
    在每个文献集合中计算文献的相似度,根据所计算的文献的相似度筛选出符合条件的文献集合;Calculating the similarity of the documents in each document collection, and filtering out the qualified document collection according to the similarity of the calculated documents;
    对筛选出的符合条件的文献集合,进行相同文献的聚类,并将相同的文献的发表来源进行汇总。For the selected set of qualified documents, cluster the same documents, and summarize the publication sources of the same documents.
  2. 根据权利要求1所述的方法,其特征在于,所述文献的标题的相似度采用以下方式中的至少一种确定:The method of claim 1 wherein the similarity of the titles of the documents is determined in at least one of the following manners:
    针对文献的标题计算签名,计算文献的标题签名之间的相似度;Calculating the signature for the title of the document and calculating the similarity between the title signatures of the document;
    计算文献的标题之间的海明距离,依据海明距离确定文献标题之间的相似度。The Hamming distance between the titles of the documents is calculated, and the similarity between the titles of the documents is determined according to the Hamming distance.
  3. 根据权利要求1所述的方法,其特征在于,在所述在每个文献集合中计算文献的相似度之前,该方法还包括:The method according to claim 1, wherein before the calculating the similarity of the document in each document collection, the method further comprises:
    根据标准化后的文献的作者,发表来源和发表年份中至少一种属性的相似度,将相似的文献进行聚类得到多个文献集合。According to the author of the standardized literature, the similarity of at least one attribute in the source and the publication year is published, and similar documents are clustered to obtain a plurality of document collections.
  4. 根据权利要求3所述的方法,其特征在于,所述根据标准化后的文献的作者,发表来源和发表年份中至少一种属性的相似度采用以下方式中的至少一种确定:The method according to claim 3, wherein the similarity of at least one of the publication source and the publication year is determined according to at least one of the following manners according to the author of the standardized document:
    将标准化后的文献的作者,发表来源及发表年份合并为字符串,计 算合并后的字符串的签名,计算文献的合并后的字符串的签名之间的相似度;Combine the author of the standardized document, the source of the publication, and the year of publication into a string Calculating the signature of the merged string and calculating the similarity between the signatures of the merged strings of the document;
    将标准化后的文献的作者,发表来源及发表年份合并为字符串,计算合并后的字符串之间的海明距离,依据海明距离确定文献的作者,发表来源及发表年份的相似度。The authors of the standardized literature, the source of publication and the year of publication are combined into a string, the Hamming distance between the merged strings is calculated, and the author of the document, the source of the publication, and the similarity of the publication year are determined according to the Hamming distance.
  5. 根据权利要求1所述的方法,其特征在于,在得到多个文献集合之后,且在每个文献集合中计算文献的相似度之前,该方法还包括:The method according to claim 1, wherein after obtaining the plurality of document collections and calculating the similarity of the documents in each document collection, the method further comprises:
    基于文献集合中文献间的海明距离,筛选出海明距离小于或等于预设阈值的文献集合。Based on the Hamming distance between the documents in the collection of documents, a collection of documents whose Hamming distance is less than or equal to a preset threshold is screened.
  6. 根据权利要求1所述的方法,其特征在于,所述根据所计算的文献的相似度筛选出符合条件的文献集合,包括:The method according to claim 1, wherein the screening of the qualified document collection according to the similarity of the calculated documents comprises:
    在每一个文献集合中,根据预先设置的各文献属性所对应的权重,计算每个文献集合中各文献间的相似度,将各文献间的相似度大于预设总分的文献集合确定为符合条件的文献集合。In each document collection, the similarity between each document in each document collection is calculated according to the weight corresponding to each document attribute set in advance, and the document collection with the similarity greater than the preset total score among the documents is determined to be consistent. A collection of documents for conditions.
  7. 根据权利要求1所述的方法,其特征在于,所述对筛选出的符合条件的文献集合,进行相同文献的聚类,包括:The method according to claim 1, wherein said clustering of the selected documents for the selected set of qualified documents comprises:
    分别针对筛选出的每个符合条件的文献集合执行键值对形成过程,所述键值对形成过程包括:分别将各文献作为key,其他文献作为该key对应的value,从而形成至少两个key-value对;Performing a key-value pair forming process for each of the selected document sets that are selected, the key-value pair forming process includes: respectively, each document is used as a key, and other documents are used as values corresponding to the key, thereby forming at least two keys. -value pair;
    依据得到的所有key-value对,将key相同的key-value对聚类到一个集合;Clustering the same key-value pairs of keys to a set based on all the key-value pairs obtained;
    分别针对得到的集合转至执行所述键值对形成过程,直至达到预设的迭代次数。 The key set formation process is performed separately for the obtained set until the preset number of iterations is reached.
  8. 根据权利要求1所述的方法,其特征在于,所述标准化包括:The method of claim 1 wherein said standardizing comprises:
    对文献的第一作者的全名进行分词处理,提取每个单词的首字母,将提取的首字母组合作为标准化后的文献作者;或者,Performing word segmentation on the full name of the first author of the document, extracting the first letter of each word, and using the extracted initial combination as the author of the standardized document; or
    提取文献摘要的主体部分中最长的句子,计算该最长句子的签名;或者,Extract the longest sentence in the body part of the document abstract and calculate the signature of the longest sentence; or,
    统一文献来源的格式;或者,Uniform literature source format; or,
    统一文献发表时间的格式,或者仅提取文献发表时间的年份。The format of the publication time of the document is unified, or only the year in which the publication of the document is published.
  9. 根据权利要求2所述的方法,其特征在于,所述针对文献的标题计算签名,包括:The method of claim 2 wherein said calculating a signature for a title of the document comprises:
    将文献的标题切分成多个子标题,计算每个子标题的长度,提取子标题的长度大于预设长度的子标题;Dividing the title of the document into a plurality of subtitles, calculating a length of each subtitle, and extracting subtitles whose subtitles are longer than a preset length;
    确定所提取的子标题的n-gram特征,所述n的取值为从1到N的正整数,所述N为预设的正整数;Determining an n-gram feature of the extracted subtitle, wherein the value of n is a positive integer from 1 to N, and the N is a preset positive integer;
    依据所确定的n-gram特征,计算文献的标题的签名。The signature of the title of the document is calculated based on the determined n-gram characteristics.
  10. 一种文献搜索方法,其特征在于,该方法包括:A document search method, characterized in that the method comprises:
    接收用户输入的关键词;Receiving keywords input by the user;
    根据所述关键词,搜索与所述关键词相关联的文献;Searching for documents associated with the keyword based on the keyword;
    在搜索结果中,将相同文献进行聚合展现,并展现各文献的发表来源;In the search results, the same documents are aggregated and displayed, and the publication sources of each document are displayed;
    其中相同文献采用如权利要求1至9任一权项所述的方法进行归一化。Wherein the same document is normalized using the method of any of claims 1 to 9.
  11. 一种文献归一装置,其特征在于,包括:A document normalization device, comprising:
    获取单元,用于获取一个以上网站来源的文献; An acquisition unit for obtaining documents from more than one website source;
    标准化单元,用于对所获取的文献进行标准化;Standardization unit for standardizing the acquired documents;
    第一聚类单元,用于根据标准化后的文献的标题的相似度,将相似标题的文献进行聚类得到多个文献集合;a first clustering unit, configured to cluster the documents of similar titles according to the similarity of the titles of the standardized documents to obtain a plurality of document collections;
    第一筛选单元,用于在每个文献集合中计算文献的相似度,根据所计算的文献的相似度筛选出符合条件的文献集合;a first screening unit, configured to calculate a similarity degree of the document in each document collection, and select a qualified document collection according to the similarity of the calculated documents;
    第二聚类单元,用于对筛选出的符合条件的文献集合,进行相同文献的聚类,并将相同的文献的发表来源进行汇总。The second clustering unit is configured to perform clustering of the same documents on the selected qualified document collections, and summarize the publication sources of the same documents.
  12. 根据权利要求11所述的装置,其特征在于,所述第一聚类单元采用以下方式中的至少一种确定文献的标题的相似度:The apparatus according to claim 11, wherein said first clustering unit determines the similarity of the titles of the documents in at least one of the following manners:
    针对文献的标题计算签名,计算文献的标题签名之间的相似度;Calculating the signature for the title of the document and calculating the similarity between the title signatures of the document;
    计算文献的标题之间的海明距离,依据海明距离确定文献标题之间的相似度。The Hamming distance between the titles of the documents is calculated, and the similarity between the titles of the documents is determined according to the Hamming distance.
  13. 根据权利要求11所述的装置,其特征在于,所述第一聚类单元,还用于在所述在每个文献集合中计算文献的相似度之前,根据标准化后的文献的作者,发表来源和发表年份中至少一种属性的相似度,将相似的文献进行聚类得到多个文献集合。The apparatus according to claim 11, wherein the first clustering unit is further configured to publish a source according to the author of the standardized document before calculating the similarity of the document in each document set. The similarity of at least one attribute in the publication year, clustering similar documents to obtain a plurality of document collections.
  14. 根据权利要求13所述的装置,其特征在于,所述第一聚类单元采用以下方式中的至少一种确定所述至少一种属性的相似度:The apparatus according to claim 13, wherein the first clustering unit determines the similarity of the at least one attribute in at least one of the following manners:
    将标准化后的文献的作者,发表来源及发表年份合并为字符串,计算合并后的字符串的签名,计算文献的合并后的字符串的签名之间的相似度;Combining the author of the standardized document, the source of publication, and the year of publication into a string, calculating the signature of the merged string, and calculating the similarity between the signatures of the merged strings of the document;
    将标准化后的文献的作者,发表来源及发表年份合并为字符串,计算合并后的字符串之间的海明距离,依据海明距离确定文献的作者,发 表来源及发表年份的相似度。Combine the authors of the standardized literature, the source of publication, and the year of publication into a string, calculate the Hamming distance between the merged strings, and determine the author of the document based on the Hamming distance. The similarity between the source of the table and the year of publication.
  15. 根据权利要求11所述的装置,其特征在于,还包括:The device according to claim 11, further comprising:
    第二筛选单元,用于在得到多个文献集合之后,且在每个文献集合中计算文献的相似度之前,基于文献集合中文献间的海明距离,筛选出海明距离小于或等于预设阈值的文献集合。a second screening unit, configured to: after obtaining the plurality of document collections, and calculating the similarity of the documents in each document collection, based on the Hamming distance between the documents in the collection of documents, the Hamming distance is selected to be less than or equal to a preset threshold Collection of documents.
  16. 根据权利要求11所述的装置,其特征在于,所述第一筛选单元具体用于,在每一个文献集合中,根据预先设置的各文献属性所对应的权重,计算每个文献集合中各文献间的相似度,将各文献间的相似度大于预设总分的文献集合确定为符合条件的文献集合。The apparatus according to claim 11, wherein the first screening unit is configured to calculate, in each document collection, each document in each document collection according to a weight corresponding to each document attribute set in advance. The similarity between the documents is determined as a set of documents that meet the criteria for a collection of documents whose similarities between the documents are greater than the preset total score.
  17. 根据权利要求11所述的装置,其特征在于,所述第二聚类单元在对筛选出的符合条件的文献集合,进行相同文献的聚类时,具体执行:The apparatus according to claim 11, wherein the second clustering unit performs the clustering of the same document on the selected set of qualified documents, and specifically executes:
    分别针对筛选出的每个符合条件的文献集合执行键值对形成过程,所述键值对形成过程包括:分别将各文献作为key,其他文献作为该key对应的value,从而形成至少两个key-value对;Performing a key-value pair forming process for each of the selected document sets that are selected, the key-value pair forming process includes: respectively, each document is used as a key, and other documents are used as values corresponding to the key, thereby forming at least two keys. -value pair;
    依据得到的所有key-value对,将key相同的key-value对聚类到一个集合;Clustering the same key-value pairs of keys to a set based on all the key-value pairs obtained;
    分别针对得到的集合转至执行所述键值对形成过程,直至达到预设的迭代次数。The key set formation process is performed separately for the obtained set until the preset number of iterations is reached.
  18. 根据权利要求11所述的装置,其特征在于,所述标准化单元,具体用于:The device according to claim 11, wherein the normalization unit is specifically configured to:
    对文献的第一作者的全名进行分词处理,提取每个单词的首字母,将提取的首字母组合作为标准化后的文献作者;或者,Performing word segmentation on the full name of the first author of the document, extracting the first letter of each word, and using the extracted initial combination as the author of the standardized document; or
    提取文献摘要的主体部分中最长的句子,计算该最长句子的签名; 或者,Extracting the longest sentence in the main part of the document abstract, and calculating the signature of the longest sentence; or,
    统一文献来源的格式;或者,Uniform literature source format; or,
    统一文献发表时间的格式,或者仅提取文献发表时间的年份。The format of the publication time of the document is unified, or only the year in which the publication of the document is published.
  19. 根据权利要求12所述的装置,其特征在于,所述第一聚类单元在针对文献的标题计算签名时,具体执行:The apparatus according to claim 12, wherein the first clustering unit performs: when calculating a signature for a title of the document, specifically:
    将文献的标题切分成多个子标题,计算每个子标题的长度,提取子标题的长度大于预设长度的子标题;Dividing the title of the document into a plurality of subtitles, calculating a length of each subtitle, and extracting subtitles whose subtitles are longer than a preset length;
    确定所提取子标题的n-gram特征,所述n的取值为从1到N的正整数,所述N为预设的正整数;Determining an n-gram feature of the extracted subtitle, the value of n being a positive integer from 1 to N, the N being a preset positive integer;
    依据所确定的n-gram特征,计算文献的标题的签名。The signature of the title of the document is calculated based on the determined n-gram characteristics.
  20. 一种文献搜索装置,其特征在于,该装置包括:A document search device, characterized in that the device comprises:
    接收单元,用于接收用户输入的关键词;a receiving unit, configured to receive a keyword input by a user;
    匹配单元,用于根据所述关键词,搜索出与所述关键词相关联的文献;a matching unit, configured to search for a document associated with the keyword according to the keyword;
    展现单元,用于在搜索结果中,将相同文献进行聚合展现,并展现各文献的发表来源,其中相同文献采用如权利要求11至19任一权项所述的装置进行归一化。A presentation unit for synthesizing the same documents in the search results and presenting the publication sources of the respective documents, wherein the same documents are normalized using the apparatus according to any one of claims 11 to 19.
  21. 一种设备,包括a device, including
    一个或者多个处理器;One or more processors;
    存储器;Memory
    一个或者多个程序,所述一个或者多个程序存储在所述存储器中,当被所述一个或者多个处理器执行时:One or more programs, the one or more programs being stored in the memory, when executed by the one or more processors:
    获取一个以上网站来源的文献; Obtain documents from more than one website source;
    对所获取的文献进行标准化;Standardize the documents obtained;
    根据标准化后的文献的标题的相似度,将相似标题的文献进行聚类得到多个文献集合;According to the similarity of the titles of the standardized documents, the documents of similar titles are clustered to obtain a plurality of document collections;
    在每个文献集合中计算文献的相似度,根据所计算的文献的相似度筛选出符合条件的文献集合;Calculating the similarity of the documents in each document collection, and filtering out the qualified document collection according to the similarity of the calculated documents;
    对筛选出的符合条件的文献集合,进行相同文献的聚类,并将相同的文献的发表来源进行汇总。For the selected set of qualified documents, cluster the same documents, and summarize the publication sources of the same documents.
  22. 一种设备,包括a device, including
    一个或者多个处理器;One or more processors;
    存储器;Memory
    一个或者多个程序,所述一个或者多个程序存储在所述存储器中,当被所述一个或者多个处理器执行时:One or more programs, the one or more programs being stored in the memory, when executed by the one or more processors:
    接收用户输入的关键词;Receiving keywords input by the user;
    根据所述关键词,搜索与所述关键词相关联的文献;Searching for documents associated with the keyword based on the keyword;
    在搜索结果中,将相同文献进行聚合展现,并展现各文献的发表来源;In the search results, the same documents are aggregated and displayed, and the publication sources of each document are displayed;
    其中相同文献采用如权利要求1至9任一权项所述的方法进行归一化。Wherein the same document is normalized using the method of any of claims 1 to 9.
  23. 一种计算机存储介质,所述计算机存储介质被编码有计算机程序,所述程序在被一个或多个计算机执行时,使得所述一个或多个计算机执行如下操作:A computer storage medium encoded with a computer program, when executed by one or more computers, causes the one or more computers to perform the following operations:
    获取一个以上网站来源的文献;Obtain documents from more than one website source;
    对所获取的文献进行标准化; Standardize the documents obtained;
    根据标准化后的文献的标题的相似度,将相似标题的文献进行聚类得到多个文献集合;According to the similarity of the titles of the standardized documents, the documents of similar titles are clustered to obtain a plurality of document collections;
    在每个文献集合中计算文献的相似度,根据所计算的文献的相似度筛选出符合条件的文献集合;Calculating the similarity of the documents in each document collection, and filtering out the qualified document collection according to the similarity of the calculated documents;
    对筛选出的符合条件的文献集合,进行相同文献的聚类,并将相同的文献的发表来源进行汇总。For the selected set of qualified documents, cluster the same documents, and summarize the publication sources of the same documents.
  24. 一种计算机存储介质,所述计算机存储介质被编码有计算机程序,所述程序在被一个或多个计算机执行时,使得所述一个或多个计算机执行如下操作:A computer storage medium encoded with a computer program, when executed by one or more computers, causes the one or more computers to perform the following operations:
    接收用户输入的关键词;Receiving keywords input by the user;
    根据所述关键词,搜索与所述关键词相关联的文献;Searching for documents associated with the keyword based on the keyword;
    在搜索结果中,将相同文献进行聚合展现,并展现各文献的发表来源;In the search results, the same documents are aggregated and displayed, and the publication sources of each document are displayed;
    其中相同文献采用如权利要求1至9任一权项所述的方法进行归一化。 Wherein the same document is normalized using the method of any of claims 1 to 9.
PCT/CN2016/087058 2015-12-07 2016-06-24 Document normalization method, document searching method, corresponding apparatuses, device, and storage medium WO2017096777A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510888584.5A CN105447169B (en) 2015-12-07 2015-12-07 Document normalizing method, literature search method and corresponding intrument
CN201510888584.5 2015-12-07

Publications (1)

Publication Number Publication Date
WO2017096777A1 true WO2017096777A1 (en) 2017-06-15

Family

ID=55557345

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/087058 WO2017096777A1 (en) 2015-12-07 2016-06-24 Document normalization method, document searching method, corresponding apparatuses, device, and storage medium

Country Status (2)

Country Link
CN (1) CN105447169B (en)
WO (1) WO2017096777A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595713A (en) * 2018-05-14 2018-09-28 中国科学院计算机网络信息中心 The method and apparatus for determining object set
CN112365374A (en) * 2020-06-19 2021-02-12 支付宝(杭州)信息技术有限公司 Standard case routing determination method, device and equipment
CN112434134A (en) * 2020-12-04 2021-03-02 中国科学院深圳先进技术研究院 Search model training method and device, terminal equipment and storage medium

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447169B (en) * 2015-12-07 2019-02-12 百度在线网络技术(北京)有限公司 Document normalizing method, literature search method and corresponding intrument
CN106708934A (en) * 2016-11-16 2017-05-24 百度在线网络技术(北京)有限公司 Artificial intelligence-based academic literature search method and apparatus
CN108132941B (en) * 2016-11-30 2021-03-26 北京国双科技有限公司 Processing method and device for incidence relation of legal document
CN107665443B (en) * 2017-05-10 2019-10-25 平安科技(深圳)有限公司 Obtain the method and device of target user

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350032A (en) * 2008-09-23 2009-01-21 胡辉 Method for judging whether web page content is identical or not
CN101807211A (en) * 2010-04-30 2010-08-18 南开大学 XML-based retrieval method oriented to constraint on integrated paths of large amount of small-size XML documents
CN101976259A (en) * 2010-11-03 2011-02-16 百度在线网络技术(北京)有限公司 Method and device for recommending series documents
CN102654879A (en) * 2011-03-04 2012-09-05 中兴通讯股份有限公司 Search method and device
CN105447169A (en) * 2015-12-07 2016-03-30 百度在线网络技术(北京)有限公司 Document normalization method, document searching method and corresponding apparatus

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090094210A1 (en) * 2007-10-05 2009-04-09 Fujitsu Limited Intelligently sorted search results
CN102012917B (en) * 2010-11-26 2013-02-20 百度在线网络技术(北京)有限公司 Information processing device and method
CN103164449B (en) * 2011-12-15 2016-04-13 腾讯科技(深圳)有限公司 A kind of exhibiting method of Search Results and device
CN103514282A (en) * 2013-09-29 2014-01-15 北京奇虎科技有限公司 Method and device for displaying search results of videos
US20150134597A1 (en) * 2013-11-08 2015-05-14 Ubc Late Stage, Inc. Document analysis and processing systems and methods

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350032A (en) * 2008-09-23 2009-01-21 胡辉 Method for judging whether web page content is identical or not
CN101807211A (en) * 2010-04-30 2010-08-18 南开大学 XML-based retrieval method oriented to constraint on integrated paths of large amount of small-size XML documents
CN101976259A (en) * 2010-11-03 2011-02-16 百度在线网络技术(北京)有限公司 Method and device for recommending series documents
CN102654879A (en) * 2011-03-04 2012-09-05 中兴通讯股份有限公司 Search method and device
CN105447169A (en) * 2015-12-07 2016-03-30 百度在线网络技术(北京)有限公司 Document normalization method, document searching method and corresponding apparatus

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595713A (en) * 2018-05-14 2018-09-28 中国科学院计算机网络信息中心 The method and apparatus for determining object set
CN108595713B (en) * 2018-05-14 2020-09-29 中国科学院计算机网络信息中心 Method and device for determining object set
CN112365374A (en) * 2020-06-19 2021-02-12 支付宝(杭州)信息技术有限公司 Standard case routing determination method, device and equipment
CN112434134A (en) * 2020-12-04 2021-03-02 中国科学院深圳先进技术研究院 Search model training method and device, terminal equipment and storage medium
WO2022116324A1 (en) * 2020-12-04 2022-06-09 中国科学院深圳先进技术研究院 Search model training method, apparatus, terminal device, and storage medium
CN112434134B (en) * 2020-12-04 2023-10-20 中国科学院深圳先进技术研究院 Search model training method, device, terminal equipment and storage medium

Also Published As

Publication number Publication date
CN105447169B (en) 2019-02-12
CN105447169A (en) 2016-03-30

Similar Documents

Publication Publication Date Title
WO2017096777A1 (en) Document normalization method, document searching method, corresponding apparatuses, device, and storage medium
WO2019091026A1 (en) Knowledge base document rapid search method, application server, and computer readable storage medium
US9323794B2 (en) Method and system for high performance pattern indexing
Kaleel et al. Cluster-discovery of Twitter messages for event detection and trending
Pereira et al. Using web information for author name disambiguation
Urvoy et al. Tracking web spam with html style similarities
KR101715432B1 (en) Word pair acquisition device, word pair acquisition method, and recording medium
WO2017020451A1 (en) Information push method and device
US20150186503A1 (en) Method, system, and computer readable medium for interest tag recommendation
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
EP2092419A2 (en) Method and system for high performance data metatagging and data indexing using coprocessors
WO2017113592A1 (en) Model generation method, word weighting method, apparatus, device and computer storage medium
WO2011057497A1 (en) Method and device for mining and evaluating vocabulary quality
WO2020248379A1 (en) Method for searching for similar network pages, and apparatus
WO2012159558A1 (en) Natural language processing method, device and system based on semantic recognition
Zhao et al. A novel burst-based text representation model for scalable event detection
WO2022116324A1 (en) Search model training method, apparatus, terminal device, and storage medium
CN108763961B (en) Big data based privacy data grading method and device
Gunawan et al. Multi-document summarization by using textrank and maximal marginal relevance for text in Bahasa Indonesia
US20140181097A1 (en) Providing organized content
US20100063966A1 (en) Method for fast de-duplication of a set of documents or a set of data contained in a file
Zhang et al. Effective and Fast Near Duplicate Detection via Signature‐Based Compression Metrics
US10380195B1 (en) Grouping documents by content similarity
CN113157857B (en) Hot topic detection method, device and equipment for news
Setty Distributed and dynamic clustering for news events

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16871981

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16871981

Country of ref document: EP

Kind code of ref document: A1