CN109918621B

CN109918621B - News text infringement detection method and device based on digital fingerprints and semantic features

Info

Publication number: CN109918621B
Application number: CN201910119330.5A
Authority: CN
Inventors: 杨鹏; 孙麟; 李幼平; 张长江; 郑斌
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2019-02-18
Filing date: 2019-02-18
Publication date: 2023-02-28
Anticipated expiration: 2039-02-18
Also published as: CN109918621A

Abstract

The invention discloses a method and a device for detecting infringement of news texts based on digital fingerprints and semantic features, which can detect whether the news of each big news media website has infringement behavior in real time by detecting the similarity of the texts. The method comprises the steps of firstly, collecting news text sample data through the Internet, and constructing an infringement sample on the basis of news original texts; then, realizing uniform coordinate systematization of news texts by using a word2vec model, and extracting text fingerprint characteristics based on an improved locality sensitive hashing method; secondly, learning text semantic features by utilizing triple loss based on a long-time memory recurrent neural network module; and finally, judging whether the text is infringing or not by calculating the similarity of the fusion of the digital fingerprint features and the semantic features. Compared with the prior art, the method has the advantages that word senses are embedded into fingerprints, plagiarism behaviors are easier to detect, and the similarity of the news text is detected by utilizing the digital features and the semantic features, so that the accuracy of infringement detection of the news text can be effectively improved.

Description

News text infringement detection method and device based on digital fingerprints and semantic features

Technical Field

The invention relates to a method and a device for detecting infringement of a news text based on digital fingerprints and semantic features.

Background

The rapid development of internet technology has made the internet the most important way for people to obtain information and resources. However, the convenience of the internet and the continuous upgrading of the information sharing technology provide convenience for people to acquire data on one hand, and provide a riding opportunity for actions such as plagiarism, illegal diffusion and the like on the other hand. The core advantage of the internet is that information can be spread rapidly and widely at nearly zero cost. This undoubtedly creates an extremely strong condition for the prosperity of the culture media industry, but also provides convenience for mass piracy, copyright infringement and copyright content producer interest damage.

Document infringement detection mainly comprises two basic detection methods: one is a method based on word frequency statistics; another class is methods based on string comparisons. The method based on word frequency statistics becomes the basis of a plurality of text similarity algorithms and is widely applied to other fields. But it has a great disadvantage that only the statistical characteristics of the words in the context are considered, the keywords are assumed to be linearly independent, and the semantic information of the words is not considered, so that there is a certain limitation on detecting the text similarity. On the basis of the thought of character string comparison hash deduplication, it is difficult to directly detect infringement behaviors such as reference plagiarism and the like.

Disclosure of Invention

The invention aims to: aiming at the problems and the defects in the prior art, the invention provides a method and a device for detecting the infringement of the news text based on the digital fingerprint and the semantic features.

The technical scheme is as follows: in order to achieve the purpose, the method for detecting the infringement of the news text based on the digital fingerprint and the semantic features utilizes an improved local-Sensitive Hashing (LSH) method, takes the correlation between words as the input of the method, extracts the text fingerprint features, then constructs a detection module based on an LSTM (Long Short-Term Memory), learns the semantic features of the text by utilizing triple Loss, and finally judges whether the news text infringes the rights by calculating the similarity of the fused digital fingerprint and the semantic features. The method can extract the characteristics of the news text from the aspects of digital fingerprints and semantics in an all-round way, and distinguishes the existing news text characteristics in the library, thereby improving the detection accuracy. The method mainly comprises four steps, specifically as follows:

(1) Collecting news texts of multiple categories through the Internet, and accumulating a sample data set; the samples in the data set comprise news text original texts and news text infringement samples constructed on the basis of the news text original texts according to plagiarism rules;

(2) Calculating text digital fingerprint features based on an improved LSH method, comprising: calculating a word vector of a news text by using a word2vec model, calculating a TF (Term Frequency) value and an IDF (Inverse Document Frequency) value of a word, and taking a TF-IDF value which is the product of the TF value and the IDF value as the weight of a corresponding word vector in the text for weighting and summing to be used as a digital fingerprint feature of the news text;

(3) Constructing triple-tuple data according to the sample data set, taking the triple-tuple data as the input of an LSTM network model, and learning text semantic features by utilizing triple loss; one of the triple data comprises an Anchor instance, a Positive instance and a Negative instance, wherein the Anchor instance is news text original text, the Positive instance is an infringement sample constructed based on the news text original text, and the Negative instance is news text original text reporting the same event but not infringement as the Anchor instance;

(4) Fusing the digital fingerprint features of the news text to be detected, which are obtained by calculation according to the method in the step (2), with the semantic features of the news text to be detected, which are extracted based on the LSTM network model trained in the step (3), calculating the similarity between the fusion features of the news text to be detected and the fusion features of the news text in the copyright library subjected to copyright authentication, and further judging whether the news text to be detected has infringement behaviors.

In a preferred embodiment, the news text collected from the internet and the constructed infringement sample in step (1) are packaged into a corresponding UCL according to the UCL standard.

In a preferred embodiment, the plagiarism rule according to which the infringement sample is constructed in the step (1) comprises one or more of complete copying, adding and deleting operation, synonym/synonym replacement and adjusting a text structure.

In a preferred embodiment, the TF value of a word is calculated in said step (2) according to the following formula:

where f (w, d) represents the word frequency of word w in text d, and v represents the most frequently occurring word in text d.

In a preferred embodiment, the IDF value of a term is calculated in said step (2) according to the following formula:

where | D | represents the total number of texts in the sample data set, | { w ∈ D, D ∈ D } | is the number of texts containing word w.

In a preferred embodiment, the digital fingerprint features calculated in step (2) are represented as:

LSH (d) denotes a text locality-sensitive hash value of text d modified for use as a digital fingerprint feature, a _w A word vector, tfidf, representing the word w in the text d _w Is the calculated TF-IDF value of the word w.

In a preferred embodiment, the target loss function of the LSTM network model training in step (3) is:

wherein A is _i For Anchor instance, P in a triplet _i Is A _i Positive example of (1), N _i Is A _i The Negative example of (b), f (·) represents the features extracted by the LSTM network, λ is a scale-up factor, α is a distance interval, N is the total number of triplets, | | ₂ Represents the Euclidean distance, [.] ₊ Represents max (., 0).

In a preferred embodiment, in the step (4), the digital fingerprint features and the semantic features of the news text to be detected are spliced and fused to obtain a fusion feature vector, and whether infringement exists is judged according to the cosine similarity between the fusion feature vector and the fusion feature vector of the news in the copyright library.

In a preferred embodiment, the news text to be detected in step (4) is news text actively submitted by a user or news text crawled on the internet without copyright certification.

The invention relates to a digital fingerprint and semantic feature-based news text infringement detection device which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the computer program realizes the digital fingerprint and semantic feature-based news text infringement detection method when being loaded to the processor.

Has the advantages that: compared with the prior art, the invention has the following advantages:

1. compared with the traditional detection method, the improved LSH detection method has the advantages that the word hash value is replaced by the word sense vector, and the infringement behaviors such as reference plagiarism and the like are easier to detect.

2. The method is based on the LSTM and the triple loss detection method, and can effectively distinguish the similar text from the infringing text.

3. The invention adopts a news text infringement detection method with the integration of digital fingerprint characteristics and semantic characteristics, and has higher accuracy, precision and recall rate on the detection result.

Drawings

FIG. 1 is a process flow diagram of an embodiment of the invention.

Fig. 2 is a flow chart of an improved LSH method in an embodiment of the present invention.

FIG. 3 is a flowchart of a method for training LSTM and triplet loss according to an embodiment of the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

As shown in fig. 1, a method for detecting infringement of a news text based on digital fingerprints and semantic features disclosed in the embodiment of the present invention mainly includes the following specific implementation steps:

step 1, accumulating a sample data set. Without loss of generality, the embodiment first collects news of various categories from the internet, and ensures that the data of each category of news is uniform, and the news of all categories jointly form a sample data set D. Since the chinese news text has no public plagiarism data, the present embodiment is constructed manually and/or by machine. The step can be divided into the following 3 steps:

and a substep 1-1, crawling news text classification. And crawling news texts of corresponding categories on the Internet website, and ensuring the balance of the quantity of the news of each category.

And a substep 1-2, packaging news into UCL (unified Content Label) defined by the national Standard "unified Content Label Format Specification" (GB/T35304-2017). Downloading HTML (hypertext markup language) original text information, extracting key information from the HTML original text information, and packaging original news webpages to generate corresponding UCL according to the UCL standard. Packaging the UCL can facilitate copyright protection and authentication and avoid information tampering using the UCL dual-signature mechanism.

And a substep 1-3, constructing an infringement sample library. And changing the original text of the news content through different plagiarism forms, and constructing a corresponding UCL. The copy-up method is shown in Table 1.

TABLE 1 common plagiarism method

And 2, calculating the digital fingerprint characteristics of the text based on an improved LSH method. And after word segmentation and word stop processing are carried out on the data set, the correlation between words is used as the input after the LSH method is corrected, the text fingerprint characteristics are extracted, and the text digital fingerprint is constructed. As shown in fig. 2, the step can be further divided into the following 2 steps:

and a substep 2-1, calculating word vectors based on the word2vec model, and encoding each word through a Huffman tree by the calculation of the word2vec model in the embodiment to be used as the input of a neural network for training. Taking an objective function of a language model based on a neural network, and taking a log-likelihood function shown in formula (1):

L＝∑ _w∈C lnp(w|Context(w)) (1)

where C represents a corpus, w is a word appearing in the corpus, and Context (w) represents the Context of w, i.e., the collection of w adjacent words. This can map words to K-dimensional vectors (a) ₁ ,a ₂ ,…,a _k )。

And a substep 2-2, calculating a text locality sensitive hash value, and firstly calculating a TF value of a word by using a formula (2):

wherein f (w, d) represents the word frequency of the word w in the text d, v represents the most frequently occurring word in the text, and the IDF value of the word is calculated by using the formula (3):

wherein | D | represents the total number of texts in the text set, | { w ∈ D, D ∈ D } | is the number of texts containing words w, and the denominator can handle the case where | { w ∈ D, D ∈ D } | is 0.

Calculating the TF-IDF value of each term using equation (4) based on the TF value and the IDF value of each term:

tfidf _(w,D) ＝tf(w,d)×idf _w,D (4)

in the traditional text locality sensitive hash calculation method, words are subjected to hash calculation and then multiplied by the weight of TF-IDF, word vectors obtained by calculation in the substep 2-1 are used for replacing word hash values, word senses are embedded into fingerprints, the correlation of the text locality sensitive hash values is enhanced, and locality sensitive characteristics are maintained. The digital fingerprint features obtained by calculation can be represented by formula (5), where d is a text, w is a word appearing in the text d, and a _w A word vector, tfidf, representing the word w _w The weight of the word w calculated for equation (4).

LSH(d)＝∑ _w∈d (a _w ×tfidf _w ) (5)

And 3, learning text semantic features based on the LSTM and the triple Loss. The step can be divided into the following 3 steps:

substep 3-1, constructing triple data; one triplet of data includes an Anchor instance, a Positive instance, and a Negative instance, where in the dataset used in this embodiment, anchor is an original news sample, positive is an infringing sample of Anchor, and Negative represents a news sample similar to Anchor but not infringing. And (3) realizing similarity calculation of samples by optimizing that the distance between the Anchor instance and the Positive instance is smaller than that between the Anchor instance and the Negative instance, wherein all the samples are news text feature matrixes constructed by the word vectors generated in the step 2-1.

According to the original text data D collected in the step 1 _A And constructed plagiarism data D _P Building a triplet (A) _i ,P _i ,N _i ) Wherein A is _i As an example of Anchor, P _i Is A _i Positive example of (1), N _i Is A _i Negative example of (N) _i And A _i Two news reports are the sameEvent, but not one party plagiarism the other), while a) _i ,P _i ,N _i Satisfies formula (6):

d(A _i ,P _i ,)<d(A _i ,N _i )<d(A _i ,P _i ,)+α (6)

wherein d (A) _i ,P _i B) represents A _i And P _i A distance between d (A) _i ,N _j ) Represents A _i And N _i And α is the distance interval.

In this embodiment, the LSTM network is used to extract the low-dimensional features of the input data, where the triple-packet data is in the form of (f (a) _i ),f(P _i ),f(N _i ) F () represents the extracted features, and according to the formula (6), it can be known that the distance requirement that the triplet needs to satisfy is as shown in the formula (7):

substep 3-2, training an LSTM network module; the objective loss function of the network obtained from equation (7) is equation (8):

wherein, the lambda is a scale amplification factor, and a random gradient descent and back propagation algorithm is used for network training. And when the network model is converged, obtaining the well-trained LSTM network, wherein the network input is a text word vector matrix, and the output is normalized text semantic features.

Substep 3-3, calculating semantic features of the text to be detected; and according to the LSTM network with the calculated weight in the substep 3-2, taking a word vector matrix of the text to be detected as input to obtain the semantic features of the text to be detected.

Step 4, text similarity detection based on digital fingerprint and semantic feature fusion; and (3) splicing and fusing the digital fingerprint features calculated in the step (2) and the semantic features extracted in the step (3), and calculating the cosine similarity of the fusion of the digital fingerprint and the semantic features so as to judge whether the text is infringing. For the feature vector, the correlation may be measured by any correlation or similarity method, the embodiment is described by taking Pearson Correlation Coefficient (PCC) as an example, and the PCC calculation formula is expressed as formula (9):

wherein, V _X And V _A A digital fingerprint and semantic feature fusion vector V respectively representing the text X to be detected and the original text A in the copyright library which has undergone copyright authentication _X,i Represents V _X In the case of the (i) th feature of (1),

represents V _X Average of all features. In a specific detection scene, the text X to be detected can have two sources, namely, the infringement is actively avoided, and the text X is actively submitted by a user to be compared with a news in a copyright library; and secondly, passive defense and infringement are carried out, online collection is carried out by a crawler system, and all the news which is not authenticated are texts to be detected.

Based on the same inventive concept, the embodiment of the present invention further provides a device for detecting infringement of news text based on digital fingerprints and semantic features, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the computer program is loaded into the processor, the method for detecting infringement of news text based on digital fingerprints and semantic features is implemented.

Claims

1. A news text infringement detection method based on digital fingerprints and semantic features is characterized by comprising the following steps:

(1) Collecting news texts of a plurality of categories through the Internet, and accumulating a sample data set; the samples in the data set comprise original news texts and infringement samples of the news texts constructed on the basis of the original news texts according to plagiarism rules;

(2) Calculating text digital fingerprint characteristics based on an improved LSH method, comprising the following steps: calculating word vectors of news texts by using a word2vec model, calculating TF values and IDF values of words, taking TF-IDF values which are products of the TF values and the IDF values as weights of corresponding word vectors in the texts, and performing weighted summation to obtain digital fingerprint characteristics of the news texts;

(3) Constructing triple group data according to the sample data set, taking the triple group data as the input of an LSTM network model, and learning text semantic features by utilizing triple loss; the method comprises the following steps:

(3-1) constructing triple group data, wherein one triple group data comprises an Anchor instance, a Positive instance and a Negative instance, the Anchor instance is news text original, the Positive instance is an infringement sample constructed based on the news text original, and the Negative instance is news text original reporting the same event but not infringement as the Anchor instance;

(3-2) training an LSTM network module; the target loss function for the LSTM network model training is:

wherein A is _i For Anchor instance, P in a triplet _i Is A _i Positive example of (1), N _i Is A _i The Negative example of (b), f (·) represents the features extracted by the LSTM network, λ is a scale-up factor, α is a distance interval, N is the total number of triplets, | | ₂ Denotes the Euclidean distance, [.] ₊ Represents max (., 0);

(3-3) inputting the word vector matrix of the text to be detected as an LSTM network to obtain the semantic features of the text to be detected;

2. The method for detecting infringement of news text based on digital fingerprints and semantic features as claimed in claim 1, wherein in the step (1), the news text collected from the internet and the constructed infringement sample are packaged into corresponding UCL according to UCL standard.

3. The method for detecting infringement of news text based on digital fingerprint and semantic features according to claim 1, wherein the plagiarism rules according to which the infringement samples are constructed in the step (1) comprise one or more of complete replication, add/delete operation, synonym/synonym replacement and text structure adjustment.

4. The method for detecting infringement of news text based on digital fingerprint and semantic features according to claim 1, wherein in the step (2), the TF value of a word is calculated according to the following formula:

5. The method for detecting infringement of news text based on digital fingerprint and semantic features according to claim 1, wherein the IDF value of a word is calculated in step (2) according to the following formula:

6. A method for detecting infringement of news text based on digital fingerprint and semantic characteristics according to claim 1, wherein the digital fingerprint characteristics calculated in step (2) are represented as:

7. The method for detecting infringement of news text based on digital fingerprints and semantic features as claimed in claim 1, wherein in the step (4), the digital fingerprint features and the semantic features of the news text to be detected are spliced and fused to obtain a fusion feature vector, and whether infringement exists is judged according to cosine similarity between the fusion feature vector and the fusion feature vector of news in a copyright library.

8. The method for detecting infringement of news texts based on digital fingerprints and semantic features as claimed in claim 1, wherein the news texts to be detected in step (4) are actively submitted by users or are crawled on the internet without copyright authentication.

9. A digital fingerprint and semantic feature based infringement detection apparatus comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the computer program when loaded into the processor implements a digital fingerprint and semantic feature based infringement detection method according to any of claims 1-8.