WO2017096777A1 - Procédé de normalisation de document, procédé de recherche de document, appareils correspondants, dispositif et support de stockage - Google Patents

Procédé de normalisation de document, procédé de recherche de document, appareils correspondants, dispositif et support de stockage Download PDF

Info

Publication number
WO2017096777A1
WO2017096777A1 PCT/CN2016/087058 CN2016087058W WO2017096777A1 WO 2017096777 A1 WO2017096777 A1 WO 2017096777A1 CN 2016087058 W CN2016087058 W CN 2016087058W WO 2017096777 A1 WO2017096777 A1 WO 2017096777A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
documents
similarity
publication
key
Prior art date
Application number
PCT/CN2016/087058
Other languages
English (en)
Chinese (zh)
Inventor
黄岳
马晋
张显
张晓婧
曹冰
徐学睿
李玉鹏
杰艺
Original Assignee
百度在线网络技术(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百度在线网络技术(北京)有限公司 filed Critical 百度在线网络技术(北京)有限公司
Publication of WO2017096777A1 publication Critical patent/WO2017096777A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Definitions

  • the present invention relates to the field of computer application technologies, and in particular, to a document normalization method, a document search method, and corresponding devices, devices, and storage media.
  • a certain document may have multiple sources of electronic sources, and the data quality of each electronic source channel is different.
  • the user cannot obtain all the electronic sources of the same document, and can only search. Viewing a source from a source is not conducive to filtering quality and licensed resources, reducing the user experience.
  • the invention provides a document normalization method, a literature search method and a corresponding device, so as to achieve the normalization of the same document, and provide a basis for improving the effect of the literature search.
  • a document normalization method including:
  • the documents of similar titles are clustered to obtain a plurality of document collections;
  • the similarity of the titles of the documents is determined in at least one of the following ways:
  • the Hamming distance between the titles of the documents is calculated, and the similarity between the titles of the documents is determined according to the Hamming distance.
  • the method before the calculating the similarity of the document in each document collection, the method further comprises:
  • the similarity of at least one attribute in the source and the publication year is published, and similar documents are clustered to obtain a plurality of document collections.
  • the similarity of at least one of the publication source and the publication year is determined by at least one of the following methods:
  • the authors of the standardized literature, the source of publication and the year of publication are combined into a string, the Hamming distance between the merged strings is calculated, and the author of the document, the source of the publication, and the similarity of the publication year are determined according to the Hamming distance.
  • the method further comprises:
  • a collection of documents whose Hamming distance is less than or equal to a preset threshold is screened.
  • the screening of the qualified document collection according to the similarity of the calculated documents comprises:
  • each document collection the similarity between each document in each document collection is calculated according to the weight corresponding to each document attribute set in advance, and the document collection with the similarity greater than the preset total score among the documents is determined to be consistent.
  • the clustering of the selected documents is performed on the selected set of qualified documents, including:
  • the key-value pair forming process includes: respectively, each document is used as a key, and other documents are used as values corresponding to the key, thereby forming at least two keys. -value pair;
  • the key set formation process is performed separately for the obtained set until the preset number of iterations is reached.
  • the standardization comprises:
  • Extract the longest sentence in the body part of the document abstract and calculate the signature of the longest sentence
  • the format of the publication time of the document is unified, or only the year in which the publication of the document is published.
  • the calculating a signature for the title of the document comprises:
  • n-gram feature of the extracted subtitle wherein the value of n is a positive integer from 1 to N, and the N is a preset positive integer;
  • the signature of the title of the document is calculated based on the determined n-gram characteristics.
  • a document search method comprising:
  • a document normalization device comprising:
  • Standardization unit for standardizing the acquired documents
  • a first clustering unit configured to cluster the documents of similar titles according to the similarity of the titles of the standardized documents to obtain a plurality of document collections
  • a first screening unit for calculating the similarity of the document in each document collection, according to the Calculating the similarity of the documents to screen out a set of qualified documents
  • the second clustering unit is configured to perform clustering of the same documents on the selected qualified document collections, and summarize the publication sources of the same documents.
  • the first clustering unit determines the similarity of the titles of the documents in at least one of the following manners:
  • the Hamming distance between the titles of the documents is calculated, and the similarity between the titles of the documents is determined according to the Hamming distance.
  • the first clustering unit is further configured to: before the calculating the similarity of the document in each document collection, according to the author of the standardized document, at least the publication source and the publication year A similarity of attributes, clustering similar documents to obtain multiple sets of documents.
  • the first clustering unit determines the similarity of the at least one attribute in at least one of the following manners:
  • the authors of the standardized literature, the source of publication and the year of publication are combined into a string, the Hamming distance between the merged strings is calculated, and the author of the document, the source of the publication, and the similarity of the publication year are determined according to the Hamming distance.
  • the method further includes:
  • a second screening unit configured to filter out the Hamming distance between the documents in the document collection after obtaining the plurality of document collections and calculating the similarity of the documents in each document collection A collection of documents whose Hamming distance is less than or equal to a preset threshold.
  • the first screening unit is specifically configured to: in each document set, calculate a similarity between each document in each document collection according to a weight corresponding to each document attribute set in advance, A collection of documents in which the similarity between the documents is greater than the preset total score is determined as a set of qualified documents.
  • the second clustering unit when the second clustering unit performs clustering of the same document on the selected set of qualified documents, the second clustering unit performs:
  • the key-value pair forming process includes: respectively, each document is used as a key, and other documents are used as values corresponding to the key, thereby forming at least two keys. -value pair;
  • the key set formation process is performed separately for the obtained set until the preset number of iterations is reached.
  • the standardization unit is specifically configured to:
  • Extract the longest sentence in the body part of the document abstract and calculate the signature of the longest sentence
  • the format of the publication time of the document is unified, or only the year in which the publication of the document is published.
  • the specific execution is:
  • n a positive integer from 1 to N
  • N a preset positive integer
  • the signature of the title of the document is calculated based on the determined n-gram characteristics.
  • a document search device comprising:
  • a receiving unit configured to receive a keyword input by a user
  • a matching unit configured to search for a document associated with the keyword according to the keyword
  • a presentation unit for synthesizing the same documents in the search results and presenting the publication sources of the respective documents, wherein the same documents are normalized using the device of the document normalization.
  • the present invention can accurately aggregate the same documents together and clearly provide the source of the literature.
  • the different publication sources of the same document can be brought together and presented to the user. Improved user experience.
  • Figure 1 is a schematic diagram of a search document in the prior art.
  • FIG. 2 is a flow chart of a method for normalizing documents according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of standardization of an author in an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of clustering the same document according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a search result presentation provided by an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of signature processing of two titles in the reduce phase in the embodiment of the present invention.
  • FIG. 7 is a flow chart of another method for normalizing documents according to an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of an apparatus according to an embodiment of the present invention.
  • FIG. 9 is a schematic structural diagram of an embodiment of the first clustering unit of FIG. 8.
  • FIG. 9 is a schematic structural diagram of an embodiment of the first clustering unit of FIG. 8.
  • FIG. 10 is a schematic structural diagram of an embodiment of the signature calculation unit of FIG. 8.
  • Figure 11 is a block diagram showing the structure of an apparatus for searching using the document normalization method.
  • FIG. 2 is a flow chart showing the first embodiment of the document normalization method of the present invention. As shown in Figure 2, the document normalization method includes:
  • documents are obtained from all websites by way of web crawling.
  • the normalization is to normalize the attributes of the document, the attributes of the document including title, author, abstract, publication source, publication time, and the like.
  • the standardization of the title includes the segmentation of the title, the unification of the full-width half-width, the removal of the punctuation of the title, and the like.
  • the title of a document is re:Coagulation and -Flocculation, which is re Coagulation and--Flocculation after standardization of the title.
  • the principle of standardization for the author is to extract the full name of the first author of the document, divide the full name of the first author into multiple words, extract the first letter of each word, and finally sort all the initials extracted as a document. Corresponding author. When the first author's full name is divided into multiple words, when there are multiple uppercase letters abbreviated together, each uppercase letter is divided into one word.
  • FIG. 3 it is a schematic diagram of standardization of the author in the present invention.
  • the names of an author obtained from the network are: Carlos N.Slia, Carlos Nascimento.Slia and SN Carlos.
  • Carlos N.Slia is divided into three words: Carlos, N and Slia.
  • the first letters of these three words are C, N, and S.
  • Carlos Nascimento.Slia was split into Carlos, Nascimento and Slia, taking the first letters of the three words C, N, S. SN Carlos is split into S, N, and Carlos.
  • the first letters of these three words are S, N, and C.
  • CNS in alphabetical order.
  • the principle of standardization of the abstract is to extract the main part of the abstract, calculate the length of the sentence in the main part, find the sentence with the longest length, and calculate the signature of the abstract of the document. In other embodiments, sentences of other lengths may also be used.
  • the signature of the digest of the document can be calculated using the Message Digest Algorithm (MD5).
  • Publication sources include journals, conferences, and collections.
  • the standardization of the source of publication is mainly to unify the format of the source of publication, including uniformization of uppercase and lowercase, deletion of symbols, and unification of full-width corners.
  • Standardization of publication time includes extracting year data from publication time.
  • publication time was: 1990, 1990-11-11, 1990/11/11, and the standardization of the publication time was obtained in 1999.
  • the same expression can be used, for example, the expressions 1990-11-11, 1990/11/11, November 11, 1990, 1990.11.11 are unified into 1990. -11-11.
  • the first set includes at least two documents.
  • S13 Calculate the similarity of the document in each first set, and select a plurality of first sets that meet the condition according to the similarity of the calculated documents.
  • the weight corresponding to the document attribute is set in advance, and the document attribute may be a feature such as an author, a summary, a publication source, and a publication time.
  • the similarity of each document in each first set is calculated according to the weight corresponding to the preset document attribute, and the first set whose similarity of each document is greater than the preset total score is determined to be in accordance with The first set of conditions.
  • document a there are two documents in a first set, assuming that the author has a weight of 4, the abstract weight is 2, the journal weight is 2, the publication time weight is 2, and the default total score is 5.
  • the characteristics of document a are as follows. : a General Stability Result for Viscoelastic Equations with Singular Kernels, author: MM Cavalcanti, Journal: missing, published: 1999-02-11, summary signature: b47b61cad59b93c5ad99e8820b71f4db; b literature features the following title: a General Stabilities Result for Viscoelastic Equations With Singular Kernels, author MC Murphy, Journal: Journal of Applied & Computational Mathematics, published: 1999, abstract signature: b47b61cad59b93c5ad99e8820b71f4db; document a is the same as the author of document b, the author corresponds to the value of 1 * 4, the same reason The publication a is different from the publication source of document b
  • the key value pair forming process is performed separately for each of the selected first sets that are selected, and the key value pair forming process in the first set that meets the condition includes: respectively, each document is used as a key, and other documents are used as The key corresponds to the value, thereby forming at least two key-value pairs; according to all the key-value pairs obtained, the key-value pairs with the same key are clustered into one set; respectively, the obtained set is transferred to execute the key The value pair formation process until a preset number of iterations is reached, the preset number of iterations being an empirical value.
  • the mapreduce model may be used to cluster the same documents in the plurality of eligible first sets that are selected. Specifically, each of the selected first sets that are filtered out is used as an input of a map stage, and a key-value pair corresponding to each of the eligible first sets is output in the map stage. All the key-value pairs corresponding to the first selected first set are sorted, and all the key-value pairs after sorting are used as the input data of the reduce stage, and the keys with the same key are in the reduce stage. The value pairs are clustered into a set, so the reduce stage outputs multiple sets, and the documents in each set form multiple key-value pairs as the input of the reduce stage.
  • the above method is used to iterate multiple times until the preset is reached. The number of iterations, the same documents in the selected plurality of eligible first sets are aggregated into one class, and all the publication sources of the document are included in the class.
  • each of the eligible first sets that are selected includes two documents
  • the plurality of selected first sets that are selected are (a, b), (b, c), (d, f)
  • the key-value pairs output by the plurality of eligible first sets that are filtered in the map stage are ab, ba, bc, cb, df, fd.
  • the literature consists of multiple key-value pairs as input to the map stage. So many iterations can get (a, b, c) as a class and (d, f) as a class.
  • a search method using the document normalization method in the embodiment includes receiving a keyword input by a user; and matching, according to the keyword, all the documents associated with the keyword Send the published source of all associated documents and each associated document to the user. Specifically, a link to the publication source of each of the associated documents is displayed to the user. In this way, the user brings together the different publishing source links of the same article, which improves the user experience.
  • the present invention is a schematic diagram in which the publication sources of the same document are gathered together. Compared with the content shown in FIG. 1, the same document as "simulation study on angle measurement accuracy of star sensor" in FIG.
  • the sources of the same literature include: ReserchGate, SPIE, reviews.spiedigita, the same documents from these sources were The aggregation is presented and the sources are shown for user selection.
  • the similarity of the titles of the documents is determined according to the similarity between the title signatures of the documents and the Hamming distance between the titles of the documents, and S12 includes:
  • the documents with similar titles are clustered to obtain a plurality of first clusters.
  • the first cluster includes at least two documents.
  • the key value pair forming process is performed on the signature of any one of the titles, and the title is first
  • the signature is divided into T parts, the T is a preset value, each block of the title is used as a key, and the signature of the title is used as a value, so that the title corresponds to T key-value pairs.
  • each title will correspond to T key-value pairs.
  • the documents with similar titles can be clustered by using the mapreduce model to obtain a plurality of first clusters, and the mapreduce model includes a map phase and a reduce phase.
  • the input data is processed by the map, and then subjected to reduce processing to finally obtain the output data.
  • the output of the map phase is in the form of a key-value pair.
  • the T blocks of each title are respectively input as the map stage, and the T key-value pairs corresponding to each title are output in the map stage.
  • the reduce stage clusters the documents corresponding to the two titles into a first cluster output.
  • the signature of the title of the document a is 111111000100100, divided into four parts of 1111, 1110, 0010, 0100, and the signature of the title of the document b is 1101111000000000 divided into four parts of 1101, 1110,000, 0000, from
  • the second block of the signature of the title of document a is identical to the second block of the signature of the title of document b.
  • the document a and the document b are clustered into a first cluster.
  • the Hamming distance is 1. If there are different values in the two positions, then the Hamming distance is 2, and so on.
  • the signature of the title of document a is 111111000100100
  • the signature of the title of document b is 1101111000000000
  • the third digit of document a and document b the eleventh digit
  • the 14th digit The number is different
  • the Hamming distance between the document a and the document b is 3.
  • S123 Filter out a first cluster whose Hamming distance is less than or equal to a preset threshold, and select a plurality of eligible first clusters to be a plurality of first sets obtained by clustering documents of similar titles.
  • the similarity of the titles of the documents may be determined based on the similarity between the title signatures of the documents or the Hamming distance between the titles of the documents.
  • the method further includes:
  • the preset length is 10 characters
  • the standardized title is R Genre classification via an lz78-based string kernel, and then divided into R and Genre classification via an lz78-based string kernel.
  • R is 1 character and its length is less than 10 characters, so R is excluded.
  • the title is "A B C", and if m is 3, the title of the document is characterized by [A, B, C, AB, BC, ABC].
  • the signature of the title of the document can be calculated using the simhash algorithm, and the signature of the title of the document is calculated to be an n-bit signature consisting of 0 and 1. For example, a 64-bit signature, a 16-bit signature, and the like.
  • FIG. 7 it is a schematic flowchart of Embodiment 2 of the normalization method of the document of the present invention, and the normalization method of the document includes:
  • documents are obtained from all websites by way of web crawling.
  • the normalization is to normalize the attributes of the document, the attributes of the document including title, author, abstract, publication source, publication time, and the like.
  • the standardization of the title includes the segmentation of the title, the unification of the full-width half-width, the removal of the punctuation of the title, and the like.
  • the principle of standardization for the author is to extract the full name of the first author of the document, divide the full name of the first author into multiple words, extract the first letter of each word, and finally sort all the initials extracted into the corresponding documents. author.
  • first author's full name is divided into multiple words, when there are multiple uppercase letters abbreviated together, each uppercase letter is divided into one word.
  • the principle of standardization of the abstract is to extract the main part of the abstract, calculate the length of the sentence in the main part, find the sentence with the longest length, and calculate the signature of the abstract of the document. In other embodiments, sentences of other lengths may also be used.
  • the signature of the digest of the document can be calculated using the Message Digest Algorithm (MD5).
  • Publication sources include journals, conferences, and collections.
  • the standardization of the source of publication is mainly to unify the format of the source of publication, including uniformization of uppercase and lowercase, deletion of symbols, and unification of full-width corners.
  • Standardization of publication time includes the extraction of the year from the publication time.
  • the publication time will be in a variety of time formats, and the standardization of publication time includes the extraction of years from a variety of different time formats.
  • the standardization of publication time includes the extraction of years from a variety of different time formats.
  • the weight corresponding to the document attribute is set in advance, and the document attribute may be a feature such as an author, a summary, a publication source, and a publication time.
  • the similarity of each document in each of the first set and the second set is calculated according to the weight corresponding to each document set in advance, and the similarity of each document is greater than the preset total
  • the first set or the second set is determined to be an eligible first set or second set.
  • S24 Perform clustering of the same document on the plurality of selected first sets that are selected and the plurality of selected second sets that are selected, and summarize the publication sources of the same documents. Links to the publication sources of the same literature can be summarized.
  • the key value pair forming process is performed separately for each of the selected first set and the second set that are selected, and the forming process of the key value pair in the first set or the second set that meets the condition includes: respectively Each document is used as a key, and other documents are used as values corresponding to the key, thereby forming at least two key-value pairs; according to all the key-value pairs obtained, the key-value pairs having the same key are clustered into one set; The collection proceeds to perform the key-value pair formation process until the preset number of iterations is reached, and the plurality of selected matches are The same set of conditions and the same documents in the second set are aggregated into one class.
  • the preset number of iterations is an empirical value.
  • the mapreduce model may be used to cluster the plurality of selected first sets and the same documents in the second set. Specifically, each of the selected first set and the second set are selected as input of a map stage, and a key-value pair corresponding to each of the first set and the second set that meets the condition is output in the map stage. All the selected key-value pairs corresponding to the first set and the second set are sorted, and all sorted key-value pairs are used as input data of the reduce stage, and the key is used in the reduce stage.
  • the same key-value pairs are clustered into a set, so that the reduce stage outputs multiple sets, and the documents in each set form a plurality of key-value pairs as input of the reduce stage, and are iterated multiple times by using the above method until reaching
  • the preset number of iterations aggregates the selected plurality of eligible first and second collections into a class, including all publication sources of the document.
  • a search method using the document normalization method in the embodiment includes receiving a keyword input by a user; and matching, according to the keyword, all the documents associated with the keyword Send the published source of all associated documents and each associated document to the user. Specifically, a link to the publication source of each of the associated documents is displayed to the user. In this way, the user brings together the different publishing source links of the same article, which improves the user experience.
  • the similarity of the titles of the documents is determined according to the similarity between the title signatures of the documents and the Hamming distance between the titles of the documents.
  • the author of the document, the source of the publication and the similarity of the publication year are determined by combining the author of the standardized document, the publication source and the publication year into a string, and calculating the combined string.
  • the signature is then determined based on the similarity between the signatures of the merged strings of the document and the Hamming distance between the merged strings of the document. Then S22 includes:
  • S220 Calculate the signature of the title of the document according to the title of the standardized document, and combine the first author of the standardized document, the publication source and the publication year into a character string, and calculate the signature of the merged character string.
  • S221 according to the signature of the title of each document, clustering two documents with similar titles to obtain a plurality of first clusters, and combining the two merged characters according to the signature of the merged string of each document.
  • Strings of similar documents are clustered to obtain a plurality of second clusters.
  • the first cluster or the second cluster includes at least two documents.
  • the key value pair forming process is performed on the signature of any one of the titles, and the signature of the title is first divided into T parts, and the T is a preset value, and each piece of the title is used as a key, and the title is Signature as value, so the title will correspond to T key-value pairs.
  • each title will correspond to T key-value pairs.
  • the documents corresponding to the two titles are clustered into a first cluster output.
  • the above method is performed for the signature of the merged character string of each document.
  • the documents corresponding to the two merged strings are clustered into a second cluster output.
  • the documents with similar titles can be clustered by using the mapreduce model to obtain a plurality of first clusters, and the mapreduce model includes a map phase and a reduce phase.
  • Input data The map is processed, and then processed by reduce, and finally the output data is obtained.
  • the output of the map phase is in the form of a key-value pair.
  • the T blocks of each title are respectively input as the map stage, and the T key-value pairs corresponding to each title are output in the map stage.
  • the reduce stage clusters the documents corresponding to the two titles into a first cluster output.
  • the mapreduce model can be used to cluster two merged strings with similar documents to obtain multiple second clusters.
  • the Hamming distance is 1. If there are different values in the two positions, then the Hamming distance is 2, and so on. If the corresponding values at a certain position of the signature of the two merged strings are different, the Hamming distance is 1. If there are different values in the two positions, then the Hamming distance is 2, and so on.
  • S223 Filter out a first cluster whose Hamming distance is less than or equal to a preset threshold, and select a plurality of first clusters that meet the condition to be a plurality of first sets obtained by clustering documents of similar titles, and screening The second cluster with the Hamming distance less than or equal to the preset threshold is selected, and the selected second clusters that are selected are the plurality of second obtained by clustering the documents corresponding to the similar merged character strings. set.
  • the similarity of the titles of the documents may be determined based on the similarity between the title signatures of the documents or the Hamming distance between the titles of the documents.
  • the author of the document, the publication source and the degree of similarity of the publication year may be determined based on the similarity between the signatures of the merged character strings of the document or the Hamming distance between the merged strings of the documents.
  • the method further includes:
  • the signature of the title of the document can be calculated using the simhash algorithm, and the signature of the title of the calculated document is an n-bit signature consisting of 0 and 1.
  • the simhash algorithm can be used to calculate the signature of the merged string.
  • the calculated signature of the merged string is an n-bit signature consisting of 0 and 1.
  • Figure 8 is a schematic block diagram showing the construction of an embodiment of the apparatus of the present invention.
  • the apparatus includes: an obtaining unit 100, a normalization unit 101, a first clustering unit 102, a first screening unit 103, and a second clustering unit 104.
  • the obtaining unit 100 is configured to obtain documents of all website sources.
  • documents are obtained from all websites by way of web crawling.
  • the normalization unit 101 is for standardizing the acquired documents.
  • the standardization is to standardize attributes of a document, and the attributes of the document include a title, an author, an abstract, a publication source, a publication time, and the like. Normalization.
  • the normalization of the title by the normalization unit 101 includes unification of the full-width of the segmentation half-width of the title, removal of the punctuation of the title, and the like.
  • the principle of the standardization unit 101 standardizing the author is to extract the full name of the first author of the document, divide the full name of the first author into a plurality of words, extract the initials of each word, and finally sort all the initials extracted into The author of the document.
  • first author's full name is divided into multiple words, when there are multiple uppercase letters abbreviated together, each uppercase letter is divided into one word.
  • the principle of standardization of the abstract by the normalization unit 101 is to extract the main part of the abstract, calculate the length of the sentence in the main part, find the sentence with the longest length, and calculate the signature of the abstract of the document. In other embodiments, sentences of other lengths may also be used.
  • the signature of the digest of the document can be calculated using the Message Digest Algorithm (MD5).
  • the standardization unit 101 standardizes the source of the publication mainly by the format of the unified publication source, including unified capitalization, deletion of symbols, and unification of the full-width half-width.
  • the normalization of the publication time by the normalization unit 101 includes extracting the year from the publication time.
  • the publication time of the document has various time formats, and the normalization unit 101 can extract the year from various different time formats.
  • the normalization unit 101 can extract the year from various different time formats.
  • the first clustering unit 102 is configured to cluster the documents of similar titles according to the similarity of the titles of the standardized documents to obtain a plurality of first sets.
  • the first set includes at least two documents.
  • the first screening unit 103 is configured to calculate the similarity of the documents in each of the first sets, according to The similarity of the calculated documents filters out a plurality of first sets that meet the criteria.
  • the first screening unit 103 is configured to: preset a weight corresponding to the document attribute, where the document attribute may be an author, a summary, a publication source, a publication time, and the like.
  • the similarity of each document in each first set is calculated according to the weight corresponding to the preset document attribute, and the first set whose similarity of each document is greater than the preset total score is determined to be in accordance with The first set of conditions.
  • the second clustering unit 104 is configured to cluster the same documents in the selected plurality of eligible first sets, and summarize the publication sources of the same documents. Links to the publication sources of the same literature can be summarized.
  • the second clustering unit 104 is configured to: perform a key-value pair forming process for each of the selected first sets that are selected, and the key-value pair forming process in the first set that meets the condition includes: respectively Each document is used as a key, and other documents are used as the value corresponding to the key, thereby forming at least two key-value pairs; according to all the key-value pairs obtained, the key-value pairs having the same key are clustered into one set; The obtained set is transferred to perform the key value pair forming process until the preset number of iterations is reached, and the same documents in the selected plurality of eligible first sets are aggregated into one class.
  • the mapreduce model may be used to cluster the same documents in the plurality of eligible first sets that are selected. Specifically, each of the selected first sets that are filtered out is used as an input of a map stage, and a key-value pair corresponding to each of the eligible first sets is output in the map stage. All the key-value pairs corresponding to the first selected first set are sorted, and all the key-value pairs after sorting are used as the input data of the reduce stage, and the keys with the same key are in the reduce stage. The value pairs are clustered into a collection, so the reduce phase will output multiple collections, and the documents in each collection will be composed more.
  • the key-value pair is used as the input of the reduce phase, and is iterated multiple times by using the above method until the preset number of iterations is reached, and the same documents in the selected plurality of eligible first sets are aggregated into one class, in the class. Includes all publication sources for this article.
  • the first clustering unit 102 includes a signature calculation unit 1020, a signature clustering unit 1021, a distance calculation unit 1022, and a second screening unit 1023.
  • the signature calculation unit 1020 is configured to calculate a signature of the title of the document based on the title of the standardized document.
  • the signature clustering unit 1021 is configured to cluster two documents with similar titles according to the signature of the title of each document to obtain a plurality of first clusters.
  • the first cluster includes at least two documents.
  • the signature clustering unit 1021 is configured to: perform a key-value pair forming process on the signature of any one of the titles, first divide the signature of the title into T-numbers, and the T is a preset value, and each of the titles The block is used as the key, and the signature of the title is used as the value, so that the title corresponds to T key-value pairs. According to the above method, each title will correspond to T key-value pairs. When there are at least one key in the T key-value pairs corresponding to the two titles, the documents corresponding to the two titles are clustered into a first cluster output.
  • the documents with similar titles can be clustered by using the mapreduce model to obtain a plurality of first clusters, and the mapreduce model includes a map phase and a reduce phase.
  • the input data is processed by the map, and then subjected to reduce processing to finally obtain the output data.
  • the output of the map phase is in the form of a key-value pair.
  • the T blocks of each title are respectively input as the map stage, and the T key-value pairs corresponding to each title are output in the map stage.
  • the reduce stage clusters the documents corresponding to the two titles into a first cluster output.
  • the distance calculation unit 1022 is configured to calculate a Hamming distance between documents in each of the first clusters based on the signature of the title of the document in each of the first clusters.
  • the second screening unit 1023 selects a first cluster whose Hamming distance is less than or equal to a preset threshold, and the selected plurality of eligible first clusters are a plurality of first sets obtained by clustering documents of similar titles. .
  • the similarity of the titles of the documents is determined based on the similarity between the title signatures of the documents and the Hamming distance between the titles of the documents. In other embodiments, the similarity of the titles of the documents may be determined based on the similarity between the title signatures of the documents or the Hamming distance between the titles of the documents.
  • FIG. 10 is a schematic structural diagram of an embodiment of a signature calculation unit of the present invention.
  • the signature calculation form 1020 includes an extraction unit 10201, a determination unit 10202, and a calculation unit 10203.
  • the extracting unit 10201 is configured to divide the title of the document into a plurality of subtitles, such as may be divided into uppercase letters. Calculate the length of each subtitle, and extract subtitles whose subtitles are longer than the preset length.
  • the determining unit 10202 is configured to determine an n-gram feature of the extracted subtitle, the value of the n is from 1 to N, and the value of the N is set according to the length of the extracted subtitle.
  • the calculating unit 10203 is configured to calculate a signature of a title of the document according to the determined n-gram feature.
  • the signature of the title of the document can be calculated using the simhash algorithm, and the signature of the title of the document is calculated to be an n-bit signature consisting of 0 and 1. For example, it is a 64-bit signature, a 16-bit signature, and the like.
  • the acquiring unit 100, the normalizing unit 101, the first clustering unit 102, the first screening unit 103, and the second clustering unit 104 in the device are also used in the fourth embodiment. in. details as follows:
  • the obtaining unit 100 is configured to obtain documents of all website sources.
  • documents are obtained from all websites by way of web crawling.
  • the normalization unit 101 is for standardizing the acquired documents.
  • the normalization is to normalize the attributes of the document, the attributes of the document including title, author, abstract, publication source, publication time, and the like.
  • the normalization unit 101 is used for normalization of the title, including segmentation of the title, unification of the full-width half-width, removal of the punctuation of the title, and the like.
  • the principle of the standardization unit 101 standardizing the author is to extract the full name of the first author of the document, divide the full name of the first author into a plurality of words, extract the initials of each word, and finally sort all the initials extracted into The author of the document.
  • first author's full name is divided into multiple words, when there are multiple uppercase letters abbreviated together, each uppercase letter is divided into one word.
  • the principle of standardization of the abstract by the normalization unit 101 is to extract the main part of the abstract, calculate the length of the sentence in the main part, find the sentence with the longest length, and calculate the signature of the abstract of the document. In other embodiments, sentences of other lengths may also be used.
  • the signature of the digest of the document can be calculated using the Message Digest Algorithm (MD5).
  • the standardization unit 101 standardizes the source of the publication mainly by the format of the unified publication source, including unified capitalization, deletion of symbols, and unification of the full-width half-width.
  • the normalization of the publication time by the normalization unit 101 includes extracting the year from the publication time.
  • the publication time of the document has various time formats, and the normalization unit 101 can extract the year from various different time formats. Of course, except for the way only the year is extracted In addition, it is also possible to adopt a method of unifying the same expression.
  • the first clustering unit 102 is configured to cluster the documents of similar titles according to the similarity of the titles of the standardized documents to obtain a plurality of first sets, and in parallel according to the first author of the standardized documents, publish the source and publish The similarity of the years, clustering similar documents to obtain multiple second sets.
  • the first screening unit 103 is configured to calculate the similarity of the documents in each first set, select a plurality of first sets that meet the conditions according to the similarity of the calculated documents, and calculate the documents in each second set. Similarity, a plurality of eligible second sets are screened according to the similarity of the calculated documents.
  • the first screening unit 103 is configured to: in each of the first set and the second set, calculate a similarity of each document in each of the first set and the second set according to a weight corresponding to each document set in advance And determining, in the first set or the second set, the similarity of each document is determined to be a first set or a second set that meets the condition.
  • the second clustering unit 104 is configured to cluster the plurality of selected first sets that are matched and the plurality of selected second sets that are selected, perform the same document, and perform the same document publishing source. Summary. Links to the publication sources of the same literature can be summarized.
  • the second clustering unit 104 is configured to: perform a key value pair forming process for each of the selected first set and the second set that are selected, respectively, in an eligible first set or second set
  • the key value pair formation process includes: respectively, each document is used as a key, and other documents are used as values corresponding to the key, thereby forming at least two key-value pairs; according to all the key-value pairs obtained, the key-value pairs having the same key Clustering to a set; respectively, for the obtained set, proceeding to execute the key-value pair forming process until a preset iteration is reached The number of times, the selected plurality of eligible first and second sets of the same documents are aggregated into one class.
  • the mapreduce model may be used to cluster the plurality of selected first sets and the same documents in the second set. Specifically, each of the selected first set and the second set are selected as input of a map stage, and a key-value pair corresponding to each of the first set and the second set that meets the condition is output in the map stage. All the selected key-value pairs corresponding to the first set and the second set are sorted, and all sorted key-value pairs are used as input data of the reduce stage, and the key is used in the reduce stage.
  • the same key-value pairs are clustered into a set, so that the reduce stage outputs multiple sets, and the documents in each set form a plurality of key-value pairs as input of the reduce stage, and are iterated multiple times by using the above method until reaching
  • the preset number of iterations aggregates the selected plurality of eligible first and second collections into a class, including all publication sources of the document.
  • the similarity of the titles of the documents is determined according to the similarity between the title signatures of the documents and the Hamming distance between the titles of the documents.
  • the author of the document, the source of publication and the similarity of the year of publication are determined by combining the author of the standardized document, the source of publication and the year of publication into a string, calculating the signature of the combined string, and then merging according to the literature.
  • the similarity between the signatures of the subsequent strings and the Hamming distance between the merged strings of the documents are determined.
  • the signature calculation unit 1020, the signature clustering unit 1021, the distance calculation unit 1022, and the second screening unit 1023 in the first clustering unit 102 are also used in the following embodiments. details as follows:
  • the signature calculation unit 1020 is configured to calculate a signature of the title of the document according to the title of the standardized document, and merge the first author, the publication source, and the publication year of the standardized document into A string that evaluates the signature of the merged string.
  • the signature clustering unit 1021 is configured to cluster two documents with similar titles according to the signature of the title of each document to obtain a plurality of first clusters, and according to the signature of the merged character string of each document, The merged strings are similarly clustered to obtain a plurality of second clusters.
  • the first cluster or the second cluster includes at least two documents.
  • the signature clustering unit 1021 is configured to: perform a key-value pair forming process on the signature of any one of the titles, first divide the signature of the title into T-numbers, and the T is a preset value, and each of the titles The block is used as the key, and the signature of the title is used as the value, so that the title corresponds to T key-value pairs.
  • each title will correspond to T key-value pairs.
  • the documents corresponding to the two titles are clustered into a first cluster output.
  • the above method is performed for the signature of the merged character string of each document.
  • the documents corresponding to the two merged strings are clustered into a second cluster output.
  • the documents with similar titles can be clustered by using the mapreduce model to obtain a plurality of first clusters, and the mapreduce model includes a map phase and a reduce phase.
  • the input data is processed by the map, and then subjected to reduce processing to finally obtain the output data.
  • the output of the map phase is in the form of a key-value pair.
  • the T blocks of each title are respectively input as the map stage, and the T key-value pairs corresponding to each title are output in the map stage.
  • the reduce stage clusters the documents corresponding to the two titles into a first cluster output.
  • the mapreduce model can be used to cluster two merged strings with similar documents to obtain multiple second clusters.
  • the distance calculation unit 1022 is configured to calculate a Hamming distance between each document in each first cluster according to a signature of a title of each document in each first cluster, and a combined character according to each document in each second cluster The signature of the string calculates the Hamming distance between the documents in each second cluster.
  • the Hamming distance is 1. If there are different values in the two positions, then the Hamming distance is 2, and so on. If the corresponding values at a certain position of the signature of the two merged strings are different, the Hamming distance is 1. If there are different values in the two positions, then the Hamming distance is 2, and so on.
  • the second screening unit 1023 is configured to filter out a first cluster whose Hamming distance is less than or equal to a preset threshold, and the plurality of selected first clusters that are selected are clusters obtained by clustering documents of similar titles. a set, and filtering out a second cluster whose Hamming distance is less than or equal to a preset threshold, and the selected second clusters that are selected are clustered by the documents corresponding to the similar merged character strings. Multiple second collections.
  • the similarity of the titles of the documents may be determined based on the similarity between the title signatures of the documents or the Hamming distance between the titles of the documents.
  • the author of the document, the publication source and the degree of similarity of the publication year may be determined based on the similarity between the signatures of the merged character strings of the document or the Hamming distance between the merged strings of the documents.
  • the extracting unit 10201, the determining unit 10202, and the calculating unit 10203 in the signature calculation unit 1020 are also used in the following embodiments. details as follows:
  • the extracting unit 10201 is configured to divide the title of the document into a plurality of subtitles, such as may be divided into uppercase letters. Calculate the length of each subtitle, and extract subtitles whose subtitles are longer than the preset length.
  • the extracting unit 10201 is further configured to divide the merged character string into a plurality of substrings, calculate a length of each substring, and extract a substring whose length of the substring is greater than a preset length.
  • the determining unit 10202 is configured to determine an n-gram feature of the extracted subtitle, the value of the n is from 1 to N, and the value of the N is set according to the length of the extracted subtitle.
  • the determining unit 10202 is further configured to determine an n-gram feature of the extracted substring.
  • the calculating unit 10203 is configured to calculate a signature of a title of the document according to the determined n-gram feature.
  • the signature of the title of the document can be calculated using the simhash algorithm, and the signature of the title of the calculated document is an n-bit signature consisting of 0 and 1.
  • the calculating unit 10203 is further configured to determine an n-gram feature of the extracted substring, and calculate a signature of the merged character string of the document.
  • the simhash algorithm can be used to calculate the signature of the merged string.
  • the calculated signature of the merged string is an n-bit signature consisting of 0 and 1.
  • the first set and the second set are only differences in expression for distinguishing between the sets of documents obtained in the two ways.
  • the apparatus for searching using the document normalization method in the first embodiment or the second embodiment, as shown in FIG. 11, includes: a receiving unit 200, a matching unit 201, and a presentation unit 202.
  • the receiving unit 200 is configured to receive a keyword input by a user.
  • the matching unit 201 is configured to match all the documents associated with the keyword according to the keyword.
  • the presentation unit 202 is configured to send the published source of all the associated documents and each associated document to the user.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
  • the above-described integrated unit implemented in the form of a software functional unit can be stored in a computer readable storage medium.
  • the above software functional unit is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform the methods of the various embodiments of the present invention. Part of the steps.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Artificial Intelligence (AREA)

Abstract

L'invention concerne un procédé de normalisation de document, un procédé de recherche de document, des appareils correspondants, un dispositif et un support de stockage. Le procédé de normalisation de document comprend les étapes suivantes : acquérir des documents à partir d'une ou plusieurs sources de site Internet ; normaliser les documents acquis ; selon le degré de similarité entre les titres des documents normalisés, grouper les documents ayant des titres similaires pour obtenir une pluralité d'ensembles de documents ; calculer le degré de similarité entre les documents dans chaque ensemble de documents, et selon le degré de similarité calculé entre les documents, éliminer un ensemble de documents qui satisfait une exigence ; et grouper les mêmes documents dans l'ensemble de documents qui est éliminé et satisfait l'exigence, et rassembler des sources de publication des mêmes documents. Le procédé de recherche de document comprend les étapes suivantes : recevoir un mot-clé entré par un utilisateur ; selon le mot-clé, rechercher des documents associés au mot-clé ; et dans le résultat de recherche, afficher les mêmes documents au moyen d'une agrégation, et afficher la source de publication de chaque document. Par comparaison avec l'état de la technique, la normalisation des mêmes documents est mise en œuvre dans la présente invention, et une base pour améliorer l'efficacité de recherche de document est fournie.
PCT/CN2016/087058 2015-12-07 2016-06-24 Procédé de normalisation de document, procédé de recherche de document, appareils correspondants, dispositif et support de stockage WO2017096777A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510888584.5A CN105447169B (zh) 2015-12-07 2015-12-07 文献归一方法、文献搜索方法及对应装置
CN201510888584.5 2015-12-07

Publications (1)

Publication Number Publication Date
WO2017096777A1 true WO2017096777A1 (fr) 2017-06-15

Family

ID=55557345

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/087058 WO2017096777A1 (fr) 2015-12-07 2016-06-24 Procédé de normalisation de document, procédé de recherche de document, appareils correspondants, dispositif et support de stockage

Country Status (2)

Country Link
CN (1) CN105447169B (fr)
WO (1) WO2017096777A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595713A (zh) * 2018-05-14 2018-09-28 中国科学院计算机网络信息中心 确定对象集合的方法和装置
CN112365374A (zh) * 2020-06-19 2021-02-12 支付宝(杭州)信息技术有限公司 标准案由确定方法、装置和设备
CN112434134A (zh) * 2020-12-04 2021-03-02 中国科学院深圳先进技术研究院 搜索模型训练方法、装置、终端设备及存储介质

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447169B (zh) * 2015-12-07 2019-02-12 百度在线网络技术(北京)有限公司 文献归一方法、文献搜索方法及对应装置
CN106708934A (zh) * 2016-11-16 2017-05-24 百度在线网络技术(北京)有限公司 基于人工智能的学术文献搜索方法和装置
CN108132941B (zh) * 2016-11-30 2021-03-26 北京国双科技有限公司 法律文献的关联关系的处理方法和装置
CN107665443B (zh) * 2017-05-10 2019-10-25 平安科技(深圳)有限公司 获取目标用户的方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350032A (zh) * 2008-09-23 2009-01-21 胡辉 判断网页内容是否相同的方法
CN101807211A (zh) * 2010-04-30 2010-08-18 南开大学 一种面向海量小规模xml文档融合路径约束的xml检索方法
CN101976259A (zh) * 2010-11-03 2011-02-16 百度在线网络技术(北京)有限公司 一种推荐系列文档的方法和装置
CN102654879A (zh) * 2011-03-04 2012-09-05 中兴通讯股份有限公司 搜索方法及装置
CN105447169A (zh) * 2015-12-07 2016-03-30 百度在线网络技术(北京)有限公司 文献归一方法、文献搜索方法及对应装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090094210A1 (en) * 2007-10-05 2009-04-09 Fujitsu Limited Intelligently sorted search results
CN102012917B (zh) * 2010-11-26 2013-02-20 百度在线网络技术(北京)有限公司 信息处理装置以及处理方法
CN103164449B (zh) * 2011-12-15 2016-04-13 腾讯科技(深圳)有限公司 一种搜索结果的展现方法与装置
CN103514282A (zh) * 2013-09-29 2014-01-15 北京奇虎科技有限公司 一种视频搜索结果展示方法及装置
WO2015070025A1 (fr) * 2013-11-08 2015-05-14 Ubc Late Stage, Inc. Systèmes et procédés d'analyse et de traitement de documents

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350032A (zh) * 2008-09-23 2009-01-21 胡辉 判断网页内容是否相同的方法
CN101807211A (zh) * 2010-04-30 2010-08-18 南开大学 一种面向海量小规模xml文档融合路径约束的xml检索方法
CN101976259A (zh) * 2010-11-03 2011-02-16 百度在线网络技术(北京)有限公司 一种推荐系列文档的方法和装置
CN102654879A (zh) * 2011-03-04 2012-09-05 中兴通讯股份有限公司 搜索方法及装置
CN105447169A (zh) * 2015-12-07 2016-03-30 百度在线网络技术(北京)有限公司 文献归一方法、文献搜索方法及对应装置

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595713A (zh) * 2018-05-14 2018-09-28 中国科学院计算机网络信息中心 确定对象集合的方法和装置
CN108595713B (zh) * 2018-05-14 2020-09-29 中国科学院计算机网络信息中心 确定对象集合的方法和装置
CN112365374A (zh) * 2020-06-19 2021-02-12 支付宝(杭州)信息技术有限公司 标准案由确定方法、装置和设备
CN112434134A (zh) * 2020-12-04 2021-03-02 中国科学院深圳先进技术研究院 搜索模型训练方法、装置、终端设备及存储介质
WO2022116324A1 (fr) * 2020-12-04 2022-06-09 中国科学院深圳先进技术研究院 Procédé de formation de modèle de recherche, appareil, dispositif terminal et support de stockage
CN112434134B (zh) * 2020-12-04 2023-10-20 中国科学院深圳先进技术研究院 搜索模型训练方法、装置、终端设备及存储介质

Also Published As

Publication number Publication date
CN105447169B (zh) 2019-02-12
CN105447169A (zh) 2016-03-30

Similar Documents

Publication Publication Date Title
WO2017096777A1 (fr) Procédé de normalisation de document, procédé de recherche de document, appareils correspondants, dispositif et support de stockage
WO2019091026A1 (fr) Procédé de recherche rapide de document dans une base de connaissances, serveur d'application, et support d'informations lisible par ordinateur
US9323794B2 (en) Method and system for high performance pattern indexing
US10423648B2 (en) Method, system, and computer readable medium for interest tag recommendation
Pereira et al. Using web information for author name disambiguation
EP2092419B1 (fr) Procédé et système pour effectuer un marquage méta de données à haute performance et un indexage de données utilisant des coprocesseurs
KR101715432B1 (ko) 단어쌍취득장치, 단어쌍취득방법 및 기록 매체
US20170322930A1 (en) Document based query and information retrieval systems and methods
WO2017020451A1 (fr) Procédé et dispositif de poussée d'informations
WO2015149533A1 (fr) Procédé et dispositif de traitement de segmentation de mots en fonction d'un classement de contenus de pages web
WO2011057497A1 (fr) Procédé et dispositif d'extraction de mots de vocabulaire et d'évaluation de leur qualité
WO2020248379A1 (fr) Procédé de recherche de pages de réseau semblables, et appareil
WO2017113592A1 (fr) Procédé de génération de modèles, procédé de pondération de mots, appareil, dispositif et support d'enregistrement informatique
WO2012159558A1 (fr) Procédé, dispositif et système de traitement du langage naturel fondé sur une reconnaissance sémantique
Zhao et al. A novel burst-based text representation model for scalable event detection
WO2022116324A1 (fr) Procédé de formation de modèle de recherche, appareil, dispositif terminal et support de stockage
US20140181097A1 (en) Providing organized content
US20100063966A1 (en) Method for fast de-duplication of a set of documents or a set of data contained in a file
JP2013222418A (ja) パッセージ分割方法、装置、及びプログラム
US10380195B1 (en) Grouping documents by content similarity
CN113157857B (zh) 面向新闻的热点话题检测方法、装置及设备
Setty Distributed and dynamic clustering for news events
Nguena et al. Fast semantic duplicate detection techniques in databases
Ganguly et al. Competing algorithm detection from research papers
US11593439B1 (en) Identifying similar documents in a file repository using unique document signatures

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16871981

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16871981

Country of ref document: EP

Kind code of ref document: A1