CN112084776A - Similar article detection method, device, server and computer storage medium - Google Patents

Similar article detection method, device, server and computer storage medium Download PDF

Info

Publication number
CN112084776A
CN112084776A CN202010967932.9A CN202010967932A CN112084776A CN 112084776 A CN112084776 A CN 112084776A CN 202010967932 A CN202010967932 A CN 202010967932A CN 112084776 A CN112084776 A CN 112084776A
Authority
CN
China
Prior art keywords
target
articles
publishing platform
title
article
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010967932.9A
Other languages
Chinese (zh)
Other versions
CN112084776B (en
Inventor
康战辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010967932.9A priority Critical patent/CN112084776B/en
Publication of CN112084776A publication Critical patent/CN112084776A/en
Application granted granted Critical
Publication of CN112084776B publication Critical patent/CN112084776B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a method, a device, a server and a computer storage medium for detecting similar articles, wherein the method comprises the following steps: aiming at each first publishing platform, identifying and obtaining a second publishing platform which has a transshipment relation with the first publishing platform according to a target article published by the first publishing platform; the target article refers to an article which is transferred from a corresponding second publishing platform by the first publishing platform; for every two pieces of target articles reprinted from the same second publishing platform, detecting whether the title similarity of the title texts of the two pieces of target articles is greater than a first threshold value; and if the title similarity is larger than the first threshold value, determining that the two detected target articles are similar. According to the scheme, on the basis of identifying the reprint relation between the publishing platforms, the reprinted similar articles are detected by calculating the title similarity of the title texts of the articles reprinted from the same publishing platform, and the characters of the title texts are far less than the characters of the whole text, so that the similar articles can be detected more quickly.

Description

Similar article detection method, device, server and computer storage medium
Technical Field
The present invention relates to the field of text detection technologies, and in particular, to a method, an apparatus, a server, and a computer storage medium for detecting similar articles.
Background
With the development of internet technology, various publishing platforms appear on the network, and the publishing platforms can publish a plurality of articles on corresponding pages. In order to uniformly manage these articles, for example, to determine whether there is a plagiarism between two articles, count the popularity of articles on a certain topic, etc., it is often necessary to detect similar articles from published articles.
At present, a method for detecting similar articles generally includes, for two articles to be detected, calculating a similarity between a total text (a text composed of a title and a body of the article) of one of the articles and a total text of the other article, and considering the two articles to be similar if the similarity is greater than a threshold value.
The full text of one article contains a large number of words, so the existing detection method implemented according to the full text of two articles has slow detection speed.
Disclosure of Invention
Based on the above shortcomings of the prior art, the present application provides a method, an apparatus, a server and a computer storage medium for detecting similar articles, so as to provide an efficient similar article detection scheme.
A first aspect of the present application provides a method for detecting a similar article, including:
aiming at each first publishing platform, identifying and obtaining a second publishing platform which has a transfer relation with the first publishing platform according to a target article published by the first publishing platform; the target article refers to an article which is transferred from the corresponding second publishing platform by the first publishing platform;
for each two pieces of target articles which are reprinted from the same second publishing platform, detecting whether the title similarity of the title texts of the two pieces of target articles is greater than a first threshold value;
and if the title similarity of the title texts of the two pieces of target articles is greater than a first threshold value, determining that the two pieces of target articles are similar.
Optionally, the identifying, according to the target article published by the first publishing platform, to obtain a second publishing platform having a reprinting relationship with the first publishing platform includes:
identifying and obtaining a reprint text for indicating a reprint behavior from a target article published by the first publishing platform;
and determining the publishing platform corresponding to the platform name quoted by the reprinting text as a second publishing platform having a reprinting relation with the first publishing platform.
Optionally, the detecting whether the title similarity of the title texts of the two pieces of target articles is greater than a first threshold includes:
respectively extracting the title texts of the two pieces of target articles;
respectively converting the title texts of the two discourse target articles into corresponding title vectors;
and calculating to obtain the title similarity of the two pieces of target articles according to the title vectors corresponding to the two pieces of target articles.
Optionally, the extracting the title texts of the two pieces of target articles respectively includes:
and for each piece of text in the two pieces of text, extracting the total title and each subtitle of the target text, and combining the total title and each subtitle of the target text into the title text of the target text.
Optionally, before determining that the two pieces of target articles are similar, the method further includes:
detecting whether the text similarity of the texts of the two pieces of target articles is greater than a second threshold value;
and if the text similarity of the texts of the two pieces of target articles is greater than the second threshold value, executing the determination that the two pieces of target articles are similar.
Optionally, the method further includes:
and counting the number of the articles similar to the target article and released by all releasing platforms in a preset time period to obtain the popularity of the target article.
A second aspect of the present application provides a device for detecting similar articles, including:
the identification unit is used for identifying and obtaining a second publishing platform which has a transfer relation with each first publishing platform according to a target article published by the first publishing platform aiming at each first publishing platform; the target article refers to an article which is transferred from the corresponding second publishing platform by the first publishing platform;
the detection unit is used for detecting whether the title similarity of the title texts of the two pieces of target articles is greater than a first threshold value or not aiming at every two pieces of target articles which are transferred from the same second publishing platform;
and the determining unit is used for determining that the two pieces of target articles are similar if the title similarity of the title texts of the two pieces of target articles is greater than a first threshold value.
Optionally, when the identifying unit identifies, according to the target article published by the first publishing platform, a second publishing platform having a reprinting relationship with the first publishing platform, the identifying unit is specifically configured to:
identifying and obtaining a reprint text for indicating a reprint behavior from a target article published by the first publishing platform;
and determining the publishing platform corresponding to the platform name quoted by the reprinting text as a second publishing platform having a reprinting relation with the first publishing platform.
A third aspect of the present application provides a computer storage medium for storing a computer program, which, when executed, is specifically configured to implement the method for detecting similar articles provided in any one of the first aspects of the present application.
A fourth aspect of the present application provides a server comprising a memory and a processor;
wherein the memory is for storing a computer program;
the processor is configured to execute the computer program, and in particular, is configured to implement the method for detecting similar articles provided in any one of the first aspects of the present application.
The application relates to a method, a device, a server and a computer storage medium for detecting similar articles, wherein the method comprises the following steps: aiming at each first publishing platform, identifying and obtaining a second publishing platform which has a transshipment relation with the first publishing platform according to a target article published by the first publishing platform; the target article refers to an article which is transferred from a corresponding second publishing platform by the first publishing platform; for every two pieces of target articles reprinted from the same second publishing platform, detecting whether the title similarity of the title texts of the two pieces of target articles is greater than a first threshold value; and if the title similarity is larger than the first threshold value, determining that the two detected target articles are similar. According to the scheme, on the basis of identifying the reprint relation between the publishing platforms, the reprinted similar articles are detected by calculating the title similarity of the title texts of the articles reprinted from the same publishing platform, and the characters of the title texts are far less than the characters of the whole text, so that the similar articles can be detected more quickly.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a method for detecting similar articles according to an embodiment of the present application;
fig. 2 is a schematic diagram illustrating a second publishing platform identified according to a target article of a first publishing platform according to an embodiment of the application;
fig. 3 is a schematic diagram of a reprint directed graph representing a reprint relationship according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a word vector model according to an embodiment of the present application;
fig. 5 is a flowchart of a method for detecting similar articles according to another embodiment of the present application;
fig. 6 is a schematic structural diagram of a detection apparatus for similar articles according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
An embodiment of the present application provides a method for detecting two similar articles in articles published by multiple publishing platforms, and please refer to fig. 1, where the method provided in this embodiment may include the following steps:
first, a brief description will be given of the distribution platform described in the present application. At present, most social networks, such as WeChat and microblog, support individuals or teams to register a publishing platform, the publishing platform can be understood as a type of account number specially used for publishing articles in one social network, after one publishing platform is registered, an operator of the publishing platform (i.e., an individual or team registering the publishing platform) can acquire articles in various ways, and then publish the articles on a page provided for the publishing platform by the social network, and other individual users can browse the articles published by the publishing platform by accessing the page of any publishing platform.
The method for the operator of the publishing platform to obtain the article includes, but is not limited to, authoring the article by the operator, cooperating with a plurality of contributors to obtain the article authored by the contributor, and the article obtained by the first two methods is called as an original article of the publishing platform. In addition, the operator of one publishing platform can also transfer articles published by other publishing platforms.
The article reprinting released by other releasing platforms means that after the releasing platform B releases an article B1, the operator of the releasing platform a copies the article B1 of the releasing platform B, and then may partially modify the article B1, including but not limited to modifying the original title, changing the typesetting of the original article, and replacing, adding, and deleting partial characters in the original article to obtain the article a1 to be released by the releasing platform a, or may not modify, and directly takes the article B1 as the article a1 to be released. The article a1 obtained by the publishing platform in this way is the article that the publishing platform a has reprinted from the publishing platform B.
In this application, the first distribution platform is used to refer to a distribution platform that distributes articles transferred from other distribution platforms, the target article distributed by the first distribution platform refers to the article distributed by the first distribution platform and transferred from other distribution platforms, and for a target article distributed by the first distribution platform, the corresponding second distribution platform refers to a source of the target article transferred by the first distribution platform.
In the foregoing example, the publishing platform a is the first publishing platform, the publishing platform B is the second publishing platform, and the article a1 published by the publishing platform a is the target article.
It will be appreciated that articles published by one publication platform may have one portion of the articles originally published and another portion of the articles reprinted, and thus one publication platform may be a first publication platform for some articles and a second publication platform for other articles.
The detection method provided by any embodiment of the present application may be executed by a server or a server cluster for managing a social network.
S101, aiming at each first publishing platform, identifying and obtaining a second publishing platform which has a transfer relation with the first publishing platform according to a target article published by the first publishing platform.
After a target article is published by a first publishing platform, which is transferred from a second publishing platform, the source of the article is typically noted at the end of the target article. For example, if the first publishing platform reloads a piece of a target article from the second publishing platform, the first publishing platform may add, for example, "reloads from: XXX ", or" source: XXX ", where" XXX "is the platform name of the second publishing platform.
Referring to fig. 2, based on the text added by the first publishing platform for source annotation, the method for identifying and obtaining the corresponding second publishing platform according to a certain discourse object article may be:
the reprint text indicating the reprint behavior is first identified at the end of the target article.
Reprinting text, including but not limited to "reprinted from: "," origin: "etc. are used to illustrate that this article is the text of the article that is currently transferred from other publishing platforms, as in fig. 2, after the first publishing platform" YY journal "publishes a target article, the text" transferred from: science popularization little knowledge ". Therefore, a list of the reprinted texts may be established in advance, a plurality of texts commonly used by the publishing platform, including but not limited to the texts for explaining the reprinting behaviors in the above example, may be recorded therein, then, for any article published by any publishing platform, whether the text in the above list exists may be identified at the end of the article, and if the text in the list exists, it may be determined that the article is a target article reprinted from other publishing platforms, and the publishing platform publishing the target article may be used as the first publishing platform.
After the reprinted text is identified, the platform name referred by the reprinted text can be extracted, that is, the text starting from the first character after the reprinted text and ending to the nearest symbol representing the text termination, such as line feed, period, space, etc., is used as the platform name referred by the reprinted text, for example, the reprinted text in fig. 2 "is reprinted from: the text before the line break, i.e. before the end of the current line, is "science popularization little knowledge", and therefore, the "science popularization little knowledge" is used as the extracted platform name.
After the platform name is obtained, the extracted platform name can be used as a keyword, whether a publishing platform with the platform name consistent with the extracted platform name exists in all publishing platforms registered in the current social network site is searched, and if the publishing platform with the corresponding platform name consistent with the platform name quoted by the reprint text extracted from the target article is searched, the searched publishing platform is determined as a second publishing platform of the target article.
With reference to the example of fig. 2, after the platform name "science popularization little knowledge" is extracted from the target article, if a publishing platform named "science popularization little knowledge" is found in the social network, the publishing platform is determined as a second publishing platform of the target article, that is, the target article published by "YY journal" is determined to be an article obtained by reprinting the "science popularization little knowledge" from the second publishing platform.
In conjunction with the above identification process, the execution process of step S101 may be:
the method comprises the steps of judging whether an article is an article reprinted from other publishing platforms or not aiming at each article published by each publishing platform currently registered in the social network, if the article is determined to be the reprinted article, determining the publishing platform currently publishing the article as a first publishing platform, determining the article as a target article, determining the publishing platform corresponding to a platform name quoted by a reprinted text of the target article as a second publishing platform having a reprint relation with the first publishing platform, and determining the reprint relation between every two publishing platforms currently registered in the social network.
Optionally, in step S101, each publishing platform registered in the social network may not be detected, but only a plurality of pre-specified publishing platforms may be detected.
The identified reprint relationship between the distribution platforms can be recorded by using a reprint directed graph as shown in fig. 3. For a target article a1 published by any one first publishing platform a, if it is identified that the target article a1 is transferred from the second publishing platform D, a connection relationship between a node corresponding to the article a1 published by the first publishing platform and a node of the second publishing platform D, that is, a connection line between the two nodes may be established, and a direction pointing from the node corresponding to the article a1 published by the first publishing platform to the node corresponding to the second publishing platform D is configured for the connection relationship, that is, the connection line between the two nodes is configured as a directional arrow.
Similarly, the target article B1 published by the first publishing platform B and the target article C1 published by the first publishing platform C are identified and transferred to the second publishing platform D, so that corresponding nodes can be established similarly, an arrow is established between the node corresponding to the target article B1 published by the first publishing platform B and the node corresponding to the second publishing platform D, and an arrow is established between the node corresponding to the target article C1 published by the first publishing platform C and the node corresponding to the second publishing platform D, so that a transfer graph as shown in fig. 3 is obtained.
S102, detecting whether the title similarity of the title texts of every two target articles reprinted from the same second publishing platform is larger than a first threshold value.
If the detected title similarity of the two pieces of the target articles is greater than the first threshold, step S103 is executed, otherwise, if the detected title similarity of the two pieces of the target articles is less than or equal to the first threshold, step S104 is executed.
The title similarity of two pieces of target articles can be calculated by the following method:
firstly, for two detected text runs, the title texts of the two text runs can be respectively extracted to obtain the title texts of the two detected text runs, then the title texts of each text run are converted into corresponding title vectors by using a word vector model, finally, the cosine similarity of the two title vectors is calculated, and the result obtained by calculation is used as the title similarity of the two detected text runs.
The cosine similarity of the two header vectors is calculated as follows:
the title vector of one of the detected pieces of text is denoted as vector X (X1, X2, … … xn), and the title vector of the other detected piece of text is denoted as vector Y (Y1, Y2, … … yn), where n is the dimension of the title vector, and the value of n is determined by the constructed word vector model and can be generally set to 200, so the cosine similarity between the title vector X and the title vector Y can be denoted as Cos (X, Y), and the calculation formula is:
Figure BDA0002683017150000081
alternatively, when step S102 is executed, the overall title of the detected target article, that is, the title placed at the beginning of the target article, may be directly used as the title text of the target article.
Optionally, when the total titles of the two detected target articles are short and multiple subtitles are both provided, each subtitle of the target article may be extracted from each detected target article, then the total title of the target article and each corresponding subtitle are combined according to the sequence of the total title of the target article appearing in the target article, and the combined text is used as the title text of the target article.
As will be explained below in connection with a specific example,
for a plurality of publishing platforms (all publishing platforms currently registered on the social network, or a plurality of publishing platforms specified in advance) that need to be detected, on the basis of identifying the reprinting relationship between each two of the publishing platforms through step S101, for a first publishing platform a that publishes a plurality of articles reprinted from other publishing platforms and another similar first publishing platform B, a plurality of articles that need to be reprinted from the two first publishing platforms, that is, the aforementioned target articles, are detected.
Firstly, the multi-piece target articles issued by the first issuing platform a and the multi-piece target articles issued by the first issuing platform B can be compared pairwise, so that the target articles which are reloaded from the same second issuing platform are detected.
In other words, it is detected for each piece of the target article H published by the first publishing platform a whether the first publishing platform B publishes the target article H from the same second publishing platform as the first publishing platform a one by one, and if it is detected that the target article H from the first publishing platform a originates from the same second publishing platform as the target article L, the two pieces of the target article H are determined as two pieces of the target article H requiring the detection of the title similarity in step S102.
If the reprint relationship is recorded in the form of a reprint graph, the search process may be to find a second publishing platform D from the target article H according to a connection relationship (i.e., an arrow) of the target article H published by the first publishing platform a in the corresponding node in the reprint graph, then traverse each article published by the first publishing platform corresponding to the node having a connection relationship with the node corresponding to the second publishing platform D, determine whether there is an article published by the first publishing platform B, and if so, determine the article published by the first publishing platform B and having a connection relationship with the node corresponding to the second publishing platform D as a target article L originated from the same second publishing platform D as the target article.
After each group of target articles to be detected is found (i.e. after each two target articles are transferred from the same second publishing platform), it can be detected whether the title similarity of the two target articles is greater than the first threshold value according to the aforementioned detection method for each group of target articles to be detected.
The first threshold value in step S102 may be set according to actual conditions, for example, the first threshold value may be set to 0.7.
It should be noted that the calculated title similarity is a real number with a value range greater than-1 and less than 1, and in the method provided by the present application, the closer the title similarity of two detected target articles that are uploaded from the same second publishing platform is to 1, the more likely the two articles are similar articles, that is, the more likely the two articles are the articles that are uploaded from the same article published by the same second publishing platform by the two first publishing platforms. With reference to the foregoing example, it is assumed that the target article H is an article obtained by the first publishing platform a transferring the article M published by the second publishing platform, the target article L is an article obtained by the first publishing platform B transferring the article N published by the second publishing platform, and if the similarity between the titles of the target article H and the target article L is greater than the first threshold, it can be considered that the article M and the article N are the same article, and the target article H and the target article L are both obtained by transferring the article.
It should be further noted that the two targeted articles detected in step S102 may be limited to two targeted articles published by two different first publishing platforms, i.e., one of the two targeted articles is published by one first publishing platform, and the other one of the two targeted articles must be published by the other first publishing platform. The method may also be not limited to the target articles published by different first publishing platforms, that is, for any two pieces of target articles, as long as the two pieces of target articles satisfy the condition of being uploaded from the same second publishing platform, step S102 is performed to detect the title similarity of the two pieces of target articles, regardless of whether the two pieces of target articles are published by the same first publishing platform or by two different first publishing platforms.
S103, determining that the two detected target articles are similar.
It should be noted that, if it is determined that the target article H is similar to the target article L after the detection and that the target article L is similar to the target article K, the target article H and the target article K may be directly determined to be similar to each other without using the method described in step S102 to detect the target article H and the target article K.
And S104, determining that the two detected target articles are not similar.
Optionally, in the method provided in this embodiment of the present application, in addition to detecting whether the target articles reprinted from the same second publishing platform and published by every two first publishing platforms are similar, it may further detect whether the target articles published by the first publishing platform are similar to each article published by the corresponding second publishing platform, so as to find the article reprinted by the first publishing platform from among the plurality of articles published by the second publishing platform.
Specifically, after the first publishing platform a is identified to publish a target article H that is reprinted from the second publishing platform D through the reprinted directed graph, the title similarity between each article that has been published by the second publishing platform D and the target article H may be detected one by one, so that an article whose title similarity between the article and the target article H is higher than the first threshold value is screened from the articles published by the second publishing platform D, and the screened article is determined as a source article corresponding to the target article H. That is, the first publishing platform a is a target article H obtained by copying the source article published by the second publishing platform and then performing appropriate rendering and modification on the source article.
In any embodiment of the present application, identifying and obtaining the platform name of the corresponding second publishing platform from the target article, and detecting the title similarity between each two pieces of target articles may be implemented by using a corresponding existing method in the Natural Language Processing (NLP) technical field.
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
Specifically, the identification of the platform name of the second publishing platform from the target article in step S101 in the foregoing embodiment may be implemented by using text recognition and text matching (or text-to-speech comparison), and the detection of the title similarity between two pieces of target articles in step S102 may be implemented by using a word vector model in the field of natural language processing technology.
Optionally, in step S102, when the title similarity between the two pieces of target articles needs to be detected, the title similarities of the two pieces of target articles may be calculated by the following method, except that the title texts of the two pieces of target articles may be respectively converted into corresponding title vectors, and the cosine similarities of the two title vectors obtained by calculation are used as the title similarities of the target articles:
the first alternative method is that, for each detected target article, a hash fingerprint (or a simhash fingerprint) of the title text of the target article is calculated.
The basic calculation principle is that the hash value of each word or word of the title text is calculated by using the existing hash algorithm, then each hash value is converted into a corresponding numerical value sequence according to the preset weight of each word or word, finally, the numerical value sequences of each word or word of the title text are accumulated, the accumulated result is converted into a corresponding binary number, and the accumulated result in a binary form is the hash fingerprint of the title text.
On this basis, the hamming distance of the hash fingerprints of the title texts of the two detected discourse target articles can be calculated, that is, the number of different characters at corresponding positions between the two hash fingerprints is counted, the obtained result is the hamming distance of the two hash fingerprints, and if one hash fingerprint is 1000 and the other hash fingerprint is 0000, the last three characters of the two hash fingerprints are the same, that is, 0, and only the first character is different, so that the hamming distance is 1.
It can be found that the hamming distance is inversely proportional to the similarity of the two title texts, and the greater the hamming distance, the lower the similarity of the two title texts, i.e. the less similar. Therefore, the reciprocal of the hamming distance of the two hash fingerprints, i.e. 1/Ha, can be used as the similarity of the titles of the two pieces of text to be detected, where Ha represents the hamming distance of the hash fingerprints of the title text of the two pieces of text.
The second optional method is that keywords of the title texts of the two detected pieces of text articles are respectively extracted by using a semantic identification method to obtain a keyword sequence corresponding to the title text of each piece of text article, and then, the matching degree of the two keyword sequences is counted, that is, the ratio of the number of the same keywords of the two keyword sequences to the total number of the keywords of the keyword sequences is counted, and the larger the ratio is, the closer the two title texts are reflected, that is, the higher the similarity of the titles of the two pieces of text articles is, so that the ratio can be directly used as the similarity of the titles of the two detected pieces of text articles.
The detection method of the similar article provided by the embodiment has the beneficial effects that:
on the first hand, on the basis of identifying the reprint relationship between distribution platforms, the embodiment realizes the detection of similar articles through the title similarity of the title texts of every two pieces of target articles, and does not need to detect the similarity of the whole texts of any two articles. It can be understood that the number of characters of the title text of one article is much smaller than the number of characters of the full text, so that the detection method for calculating the similarity of titles of every two pieces of target articles provided by this embodiment has a faster detection speed compared with the existing method for detecting the similarity of full texts of two articles.
In the second aspect, the detection method provided in this embodiment determines that two pieces of text are similar only when the two pieces of text are uploaded from the same second publishing platform and the corresponding headline texts are similar, and because the two pieces of text are constrained by being uploaded from the same second publishing platform, even if only the headline similarities of the two pieces of text are detected, the detection result, that is, whether the two pieces of text are similar, has higher accuracy, and the accuracy of the detection result is not reduced because only the similarity of the headline texts is detected.
The word vector model is a neural network model commonly used in the technical field of natural language processing, and functions to determine, for each of a vocabulary set (usually, a vocabulary set obtained by segmenting a plurality of articles), a unique word vector (i.e., word vectors corresponding to every two different vocabularies are different) related to the semantics of the vocabulary in the articles, so that a computer can analyze texts composed of the vocabularies in a vector processing manner.
The embodiment of the present application may adopt an existing word vector model with any structure to realize the step of converting the heading text into the corresponding heading vector, and the following briefly describes a word vector model applicable to the present application and its working principle.
An alternative word vector model is shown in fig. 4, and includes three parts, namely an input layer, a hidden layer and an output layer. In the initial word vector model, the hidden layer and the output layer both include a large number of preset initial parameters, the parameters and the corresponding calculation modes thereof can be understood as neurons of the hidden layer and the output layer, the input layer is pre-configured with a vocabulary code of each vocabulary in the vocabulary set to be processed, and the vocabulary code of each vocabulary is unique, that is, any two different vocabularies have different vocabulary codes.
The working principle of the word vector model shown in fig. 4 is that any word (marked as the current word) is selected as the input of the input layer, then the input layer finds out the word code of the word and transmits it to the hidden layer, the hidden layer uses the current parameters to calculate the current word vector of the word, and transmits the current word vector of the word to the output layer.
The output layer calculates the word vector model loss of the context vocabulary of the current vocabulary according to the word vector of the current vocabulary, the vocabulary codes of other vocabularies (marked as the context vocabulary of the current vocabulary) positioned before and after the current vocabulary in a certain sentence and the currently configured parameters of the output layer, and outputs the word vector model loss of the context vocabulary of the current vocabulary.
If the word vector model loss of the context vocabulary of the current vocabulary does not meet the preset convergence condition, the parameters of the hidden layer and the output layer need to be updated according to the word vector model loss, and after the updating, the process is repeated until the word vector model loss of the context vocabulary of the current vocabulary output by the output layer meets the convergence condition.
And finally, the operation is required to be executed for each vocabulary in the vocabulary set, when each vocabulary in the vocabulary set is input into the word vector model and the word vector model loss of the context vocabulary output by the word vector model meets the convergence condition, the word vector model is constructed, and at the moment, the word vector of each vocabulary can be obtained only by inputting the vocabulary code of each vocabulary into the hidden layer of the word vector model. Moreover, any two word vectors obtained by converting the same word vector model have the same dimension, which is determined by the structure of the word vector model, for example, each word vector obtained by converting may be a 200-dimensional vector.
In step S102 of the embodiment of the present application, for each detected article of two pieces of target articles, the title text of the article may be segmented to obtain a plurality of words and phrases constituting the title text, each word and phrase is converted into a corresponding word vector by using a trained word vector model, and finally, word vectors of all words and phrases of the title text are accumulated, and the accumulated result is the title vector of the title text.
For example, suppose a title text of a target article is split into M words, which are sequentially marked as Word1,Word2……WordMThen, the Word vector models are used one by one to be converted into corresponding Word vectors, wherein Word1The word Vector is noted as Vector1,Word2The word Vector is noted as Vector1And so on. And recording the title vector of the title text as VectorS, and calculating the title vector according to the following formula:
Figure BDA0002683017150000131
wherein for any two vectors, the components in the corresponding dimensions of the two vectors are added, and the resultant Vector combined by the multiple results is the Vector obtained by adding the two vectors, e.g. Vector1+Vector2The substance is that Vector is1First component of (1) and Vector2Is added to the first component of (a) to obtain the result as a sum Vector (i.e., Vector)1+Vector2Result of (d), Vector, and the like1Second component of (1) and Vector2Is added to obtain the result as the second component of the sum vector, and so on.
Referring to fig. 5, an embodiment of the present application further provides a method for detecting a similar article, where the method includes the following steps:
s501, aiming at each first publishing platform, identifying and obtaining a second publishing platform which has a transfer relation with the first publishing platform according to a target article published by the first publishing platform.
S502, detecting whether the title similarity of the title texts of every two target articles reprinted from the same second publishing platform is larger than a first threshold value.
The specific implementation of step S501 and step S502 is the same as that of step S101 and step S102 in the foregoing embodiment of fig. 1, and will not be described in detail here.
If the detected title similarity of the title texts of the two pieces of the target articles is greater than the first threshold, the step S503 is executed, otherwise, if the detected title similarity of the title texts of the two pieces of the target articles is less than or equal to the first threshold, the step S505 is executed.
S503, detecting whether the text similarity of the texts of the two pieces of target articles is larger than a second threshold value.
The two pieces of target articles detected in step S503 refer to the target articles obtained by detection in step S502 and reprinted from the same second publishing platform, and the title similarity of the target articles is greater than the first threshold.
If the text similarity of the two pieces of the target articles is greater than the second threshold value, it is determined that the two pieces of the target articles are similar, that is, step S504 is performed, otherwise, if the text similarity of the two pieces of the target articles is less than or equal to the second threshold value, step S505 is performed, that is, it is determined that the two pieces of the target articles are not similar.
That is to say, in the embodiment, for two pieces of the target articles obtained by being downloaded from the same second publishing platform, only when the similarity of the titles of the two pieces of the target articles is greater than the first threshold and the similarity of the text is greater than the second threshold, it is determined that the two pieces of the target articles are similar, and if the similarity of the text or the similarity of the titles of the two pieces of the target articles is not greater than the corresponding threshold, it is determined that the two pieces of the target articles are not similar.
S504, the detected two pieces of target articles are determined to be similar.
And S505, determining that the two detected target articles are not similar.
On the one hand, on the premise that the similarity of the titles of two target articles reprinted from the same second publishing platform is greater than the first threshold value, the method further increases the step of detecting whether the similarity of the texts of the two pieces of target articles is greater than the second threshold value, so that the problem that the two target articles with similar titles but extremely different actual contents are determined as similar articles is avoided, and the accuracy of the detection result is further improved.
On the other hand, in the detection of the embodiment, each two pieces of target articles that are reloaded from the same second publishing platform are first screened, then each two pieces of target articles that are similar in title text are further screened, and finally the text similarity detection is performed on each two pieces of articles that are reloaded from the same second publishing platform and similar in title text, which are obtained through screening, so that the text similarity detection can be avoided for a large number of articles of the whole social network, and the calculation amount during the detection is reduced.
For convenience of understanding the detection method provided in the embodiment of the present application, an application scenario of the detection method provided in the embodiment of the present application is described below:
suppose that after a first publishing platform a reprints an article X published by a second publishing platform B and publishes the article X as an article Y on its own page (i.e., the first publishing platform a), the manager of the social network needs to count the popularity of the article Y, that is, the number of times that other publishing platforms reprint the article X of the second publishing platform B except the first publishing platform a.
Then, the administrator of the social network may run the method provided in any embodiment of the present application on a server (or a server cluster) for managing the social network, and first identify the reprint relationship between all the publishing platforms currently registered in the social network, so as to find each first publishing platform from all the publishing platforms, on which the article of the second publishing platform B is reprinted. Then, for each first distribution platform of the articles that are transferred from the second distribution platform B, it is detected whether the target article transferred from the second distribution platform B and the article Y that are released by the first distribution platform are similar one by one, and if any article transferred from the second distribution platform B is similar to the article Y, it is described that the article is the article obtained by transferring the article X of the second distribution platform B, in other words, it is described that the article X that is released by the second distribution platform B is transferred once every time an article that is transferred from the second distribution platform B and is similar to the article Y is detected, and thus, the popularity of the article Y can be known by detecting the number of articles that are transferred from the second distribution platform B and are similar to the article Y by the detection method provided in any embodiment of the present application.
With reference to fig. 6, the apparatus may include the following units:
the identifying unit 601 is configured to identify, for each first publishing platform, a second publishing platform having a reprinting relationship with the first publishing platform according to the target article published by the first publishing platform.
The target article refers to an article which is transferred from the corresponding second publishing platform by the first publishing platform.
The detecting unit 602 is configured to detect, for each two pieces of target articles that are reprinted from the same second publishing platform, whether the title similarity of the title texts of the two pieces of target articles is greater than a first threshold.
The determining unit 603 is configured to determine that the two pieces of target articles are similar if the title similarity of the title texts of the two pieces of target articles is greater than the first threshold.
When the identifying unit 601 identifies, according to the target article published by the first publishing platform, a second publishing platform having a reprinting relationship with the first publishing platform, specifically:
identifying and obtaining a reprint text for indicating the reprint behavior from a target article published by a first publishing platform;
and determining the publishing platform corresponding to the platform name quoted by the reprinting text as a second publishing platform having a reprinting relation with the first publishing platform.
The detecting unit 602 is configured to, when detecting whether the title similarity of the title texts of the two pieces of target articles is greater than a first threshold value, specifically:
respectively extracting the title texts of two target articles;
respectively converting the title texts of the two sections of target articles into corresponding title vectors;
and calculating to obtain the title similarity of the two pieces of target articles according to the title vectors corresponding to the two pieces of target articles.
Optionally, when the detecting unit 602 extracts the title texts of two pieces of target articles, the detecting unit is specifically configured to:
and aiming at each piece of text in the two pieces of text, extracting the general title and each subtitle of the target text, and combining the general title and each subtitle of the target text into the title text of the target text.
Optionally, the determining unit 603 is further configured to:
detecting whether the text similarity of the texts of the two pieces of target articles is greater than a second threshold value;
and if the text similarity of the texts of the two pieces of target articles is larger than the second threshold value, performing to determine that the two pieces of target articles are similar.
Optionally, the detection apparatus provided in this embodiment further includes:
the counting unit 604 is configured to count, for any piece of target article, the number of articles similar to the target article and issued by all the issuing platforms within a preset time period, so as to obtain the popularity of the target article.
The specific working principle of the detection device provided in the embodiment of the present application may refer to the detection method of the similar article provided in any embodiment of the present application, and details are not repeated here.
The application relates to a detection device for similar articles, which comprises: the identification unit 601 identifies, for each first publishing platform, a second publishing platform having a reprinting relationship with the first publishing platform according to a target article published by the first publishing platform; the target article refers to an article which is transferred from a corresponding second publishing platform by the first publishing platform; the detecting unit 602 detects whether the title similarity of the title texts of two pieces of target articles is greater than a first threshold for every two pieces of target articles loaded from the same second publishing platform; the determining unit 603 determines that the detected two pieces of target articles are similar if the title similarity is greater than the first threshold. According to the scheme, on the basis of identifying the reprint relation between the publishing platforms, the reprinted similar articles are detected by calculating the title similarity of the title texts of the articles reprinted from the same publishing platform, and the characters of the title texts are far less than the characters of the whole text, so that the similar articles can be detected more quickly.
Referring to fig. 7, the electronic device includes a memory 701 and a processor 702, where the memory 701 is used for storing a computer program, and the processor 702 is used for executing the computer program stored in the memory, and is specifically used for executing the method for detecting similar articles provided in any embodiment of the present application.
The embodiment of the present application further provides a computer storage medium, which is used for storing a computer program, and when the stored computer program is executed, the computer storage medium is specifically used for implementing the detection method for the similar articles provided in any embodiment of the present application.
According to an aspect of the present application, the present application embodiment also provides a computer program product or a computer program, which includes computer instructions, which are stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method for detecting similar articles provided in the various alternative implementations of any of the aspects.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
It should be noted that the terms "first", "second", and the like in the present invention are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
Those skilled in the art can make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for detecting similar articles, comprising:
aiming at each first publishing platform, identifying and obtaining a second publishing platform which has a transfer relation with the first publishing platform according to a target article published by the first publishing platform; the target article refers to an article which is transferred from the corresponding second publishing platform by the first publishing platform;
for each two pieces of target articles which are reprinted from the same second publishing platform, detecting whether the title similarity of the title texts of the two pieces of target articles is greater than a first threshold value;
and if the title similarity of the title texts of the two pieces of target articles is greater than a first threshold value, determining that the two pieces of target articles are similar.
2. The method according to claim 1, wherein the identifying and obtaining a second publication platform having a reprint relationship with the first publication platform according to the target article published by the first publication platform comprises:
identifying and obtaining a reprint text for indicating a reprint behavior from a target article published by the first publishing platform;
and determining the publishing platform corresponding to the platform name quoted by the reprinting text as a second publishing platform having a reprinting relation with the first publishing platform.
3. The method of claim 1, wherein said detecting whether the headline similarity of the headline text of the two-piece targeted article is greater than a first threshold value comprises:
respectively extracting the title texts of the two pieces of target articles;
respectively converting the title texts of the two discourse target articles into corresponding title vectors;
and calculating to obtain the title similarity of the two pieces of target articles according to the title vectors corresponding to the two pieces of target articles.
4. The method of claim 3, wherein the extracting the headline text of the two-piece articles respectively comprises:
and for each piece of text in the two pieces of text, extracting the total title and each subtitle of the target text, and combining the total title and each subtitle of the target text into the title text of the target text.
5. The method of claim 1, wherein prior to determining that the two-piece articles are similar, further comprising:
detecting whether the text similarity of the texts of the two pieces of target articles is greater than a second threshold value;
and if the text similarity of the texts of the two pieces of target articles is greater than the second threshold value, executing the determination that the two pieces of target articles are similar.
6. The detection method according to claim 1, further comprising:
and counting the number of the articles similar to the target article and released by all releasing platforms in a preset time period to obtain the popularity of the target article.
7. A device for detecting similar articles, comprising:
the identification unit is used for identifying and obtaining a second publishing platform which has a transfer relation with each first publishing platform according to a target article published by the first publishing platform aiming at each first publishing platform; the target article refers to an article which is transferred from the corresponding second publishing platform by the first publishing platform;
the detection unit is used for detecting whether the title similarity of the title texts of the two pieces of target articles is greater than a first threshold value or not aiming at every two pieces of target articles which are transferred from the same second publishing platform;
and the determining unit is used for determining that the two pieces of target articles are similar if the title similarity of the title texts of the two pieces of target articles is greater than a first threshold value.
8. The apparatus according to claim 7, wherein the identifying unit, when identifying, according to the target article published by the first publishing platform, a second publishing platform having a reprinting relationship with the first publishing platform, is specifically configured to:
identifying and obtaining a reprint text for indicating a reprint behavior from a target article published by the first publishing platform;
and determining the publishing platform corresponding to the platform name quoted by the reprinting text as a second publishing platform having a reprinting relation with the first publishing platform.
9. A computer storage medium for storing a computer program which, when executed, is particularly adapted to implement a method of detecting similar articles as claimed in any one of claims 1 to 6.
10. A server, comprising a memory and a processor;
wherein the memory is for storing a computer program;
the processor is configured to execute the computer program, and in particular to implement the method for detecting similar articles according to any one of claims 1 to 6.
CN202010967932.9A 2020-09-15 2020-09-15 Method, device, server and computer storage medium for detecting similar articles Active CN112084776B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010967932.9A CN112084776B (en) 2020-09-15 2020-09-15 Method, device, server and computer storage medium for detecting similar articles

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010967932.9A CN112084776B (en) 2020-09-15 2020-09-15 Method, device, server and computer storage medium for detecting similar articles

Publications (2)

Publication Number Publication Date
CN112084776A true CN112084776A (en) 2020-12-15
CN112084776B CN112084776B (en) 2023-11-10

Family

ID=73737114

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010967932.9A Active CN112084776B (en) 2020-09-15 2020-09-15 Method, device, server and computer storage medium for detecting similar articles

Country Status (1)

Country Link
CN (1) CN112084776B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112529091A (en) * 2020-12-18 2021-03-19 广州视源电子科技股份有限公司 Courseware similarity detection method and device and storage medium
CN113408660A (en) * 2021-07-15 2021-09-17 北京百度网讯科技有限公司 Book clustering method, device, equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231641A (en) * 2007-01-22 2008-07-30 北大方正集团有限公司 Method and system for automatic analysis of hotspot subject propagation process in the internet
CN103646029A (en) * 2013-11-04 2014-03-19 北京中搜网络技术股份有限公司 Similarity calculation method for blog articles
CN106547780A (en) * 2015-09-21 2017-03-29 北京国双科技有限公司 Article reprints statistics of variables method and device
CN106776609A (en) * 2015-11-19 2017-05-31 北京国双科技有限公司 Reprint the statistical method and device of quantity in website
CN107609106A (en) * 2017-09-12 2018-01-19 马上消费金融股份有限公司 Similar article searching method, device, equipment and storage medium
CN110134788A (en) * 2019-05-16 2019-08-16 杭州师范大学 A kind of microblogging publication optimization method and system based on text mining
CN110162750A (en) * 2019-01-24 2019-08-23 腾讯科技(深圳)有限公司 Text similarity detection method, electronic equipment and computer readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231641A (en) * 2007-01-22 2008-07-30 北大方正集团有限公司 Method and system for automatic analysis of hotspot subject propagation process in the internet
CN103646029A (en) * 2013-11-04 2014-03-19 北京中搜网络技术股份有限公司 Similarity calculation method for blog articles
CN106547780A (en) * 2015-09-21 2017-03-29 北京国双科技有限公司 Article reprints statistics of variables method and device
CN106776609A (en) * 2015-11-19 2017-05-31 北京国双科技有限公司 Reprint the statistical method and device of quantity in website
CN107609106A (en) * 2017-09-12 2018-01-19 马上消费金融股份有限公司 Similar article searching method, device, equipment and storage medium
CN110162750A (en) * 2019-01-24 2019-08-23 腾讯科技(深圳)有限公司 Text similarity detection method, electronic equipment and computer readable storage medium
CN110134788A (en) * 2019-05-16 2019-08-16 杭州师范大学 A kind of microblogging publication optimization method and system based on text mining

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112529091A (en) * 2020-12-18 2021-03-19 广州视源电子科技股份有限公司 Courseware similarity detection method and device and storage medium
CN113408660A (en) * 2021-07-15 2021-09-17 北京百度网讯科技有限公司 Book clustering method, device, equipment and storage medium
CN113408660B (en) * 2021-07-15 2024-05-24 北京百度网讯科技有限公司 Book clustering method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN112084776B (en) 2023-11-10

Similar Documents

Publication Publication Date Title
CN109657054B (en) Abstract generation method, device, server and storage medium
CN104615767B (en) Training method, search processing method and the device of searching order model
CN110298035B (en) Word vector definition method, device, equipment and storage medium based on artificial intelligence
CN110162749A (en) Information extracting method, device, computer equipment and computer readable storage medium
CN111222305A (en) Information structuring method and device
WO2024131111A1 (en) Intelligent writing method and apparatus, device, and nonvolatile readable storage medium
CN110581864B (en) Method and device for detecting SQL injection attack
CN112131881B (en) Information extraction method and device, electronic equipment and storage medium
CN113282711B (en) Internet of vehicles text matching method and device, electronic equipment and storage medium
CN112805715A (en) Identifying entity attribute relationships
CN111898369A (en) Article title generation method, model training method and device and electronic equipment
CN112084776B (en) Method, device, server and computer storage medium for detecting similar articles
CN113988061A (en) Sensitive word detection method, device and equipment based on deep learning and storage medium
CN111026840A (en) Text processing method, device, server and storage medium
Noaman et al. Enhancing recurrent neural network-based language models by word tokenization
CN113590810A (en) Abstract generation model training method, abstract generation device and electronic equipment
CN114780709A (en) Text matching method and device and electronic equipment
Babatunde et al. Automatic table recognition and extraction from heterogeneous documents
CN110020024B (en) Method, system and equipment for classifying link resources in scientific and technological literature
CN113111178B (en) Method and device for disambiguating homonymous authors based on expression learning without supervision
Takase et al. Composing distributed representations of relational patterns
CN114372454A (en) Text information extraction method, model training method, device and storage medium
WO2023087935A1 (en) Coreference resolution method, and training method and apparatus for coreference resolution model
Suneera et al. A bert-based question representation for improved question retrieval in community question answering systems
Majumder et al. Event extraction from biomedical text using crf and genetic algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant