CN109918621B - News text infringement detection method and device based on digital fingerprints and semantic features - Google Patents

News text infringement detection method and device based on digital fingerprints and semantic features Download PDF

Info

Publication number
CN109918621B
CN109918621B CN201910119330.5A CN201910119330A CN109918621B CN 109918621 B CN109918621 B CN 109918621B CN 201910119330 A CN201910119330 A CN 201910119330A CN 109918621 B CN109918621 B CN 109918621B
Authority
CN
China
Prior art keywords
text
news
infringement
word
texts
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910119330.5A
Other languages
Chinese (zh)
Other versions
CN109918621A (en
Inventor
杨鹏
孙麟
李幼平
张长江
郑斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201910119330.5A priority Critical patent/CN109918621B/en
Publication of CN109918621A publication Critical patent/CN109918621A/en
Application granted granted Critical
Publication of CN109918621B publication Critical patent/CN109918621B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method and a device for detecting infringement of news texts based on digital fingerprints and semantic features, which can detect whether the news of each big news media website has infringement behavior in real time by detecting the similarity of the texts. The method comprises the steps of firstly, collecting news text sample data through the Internet, and constructing an infringement sample on the basis of news original texts; then, realizing uniform coordinate systematization of news texts by using a word2vec model, and extracting text fingerprint characteristics based on an improved locality sensitive hashing method; secondly, learning text semantic features by utilizing triple loss based on a long-time memory recurrent neural network module; and finally, judging whether the text is infringing or not by calculating the similarity of the fusion of the digital fingerprint features and the semantic features. Compared with the prior art, the method has the advantages that word senses are embedded into fingerprints, plagiarism behaviors are easier to detect, and the similarity of the news text is detected by utilizing the digital features and the semantic features, so that the accuracy of infringement detection of the news text can be effectively improved.

Description

News text infringement detection method and device based on digital fingerprints and semantic features
Technical Field
The invention relates to a method and a device for detecting infringement of a news text based on digital fingerprints and semantic features.
Background
The rapid development of internet technology has made the internet the most important way for people to obtain information and resources. However, the convenience of the internet and the continuous upgrading of the information sharing technology provide convenience for people to acquire data on one hand, and provide a riding opportunity for actions such as plagiarism, illegal diffusion and the like on the other hand. The core advantage of the internet is that information can be spread rapidly and widely at nearly zero cost. This undoubtedly creates an extremely strong condition for the prosperity of the culture media industry, but also provides convenience for mass piracy, copyright infringement and copyright content producer interest damage.
Document infringement detection mainly comprises two basic detection methods: one is a method based on word frequency statistics; another class is methods based on string comparisons. The method based on word frequency statistics becomes the basis of a plurality of text similarity algorithms and is widely applied to other fields. But it has a great disadvantage that only the statistical characteristics of the words in the context are considered, the keywords are assumed to be linearly independent, and the semantic information of the words is not considered, so that there is a certain limitation on detecting the text similarity. On the basis of the thought of character string comparison hash deduplication, it is difficult to directly detect infringement behaviors such as reference plagiarism and the like.
Disclosure of Invention
The invention aims to: aiming at the problems and the defects in the prior art, the invention provides a method and a device for detecting the infringement of the news text based on the digital fingerprint and the semantic features.
The technical scheme is as follows: in order to achieve the purpose, the method for detecting the infringement of the news text based on the digital fingerprint and the semantic features utilizes an improved local-Sensitive Hashing (LSH) method, takes the correlation between words as the input of the method, extracts the text fingerprint features, then constructs a detection module based on an LSTM (Long Short-Term Memory), learns the semantic features of the text by utilizing triple Loss, and finally judges whether the news text infringes the rights by calculating the similarity of the fused digital fingerprint and the semantic features. The method can extract the characteristics of the news text from the aspects of digital fingerprints and semantics in an all-round way, and distinguishes the existing news text characteristics in the library, thereby improving the detection accuracy. The method mainly comprises four steps, specifically as follows:
(1) Collecting news texts of multiple categories through the Internet, and accumulating a sample data set; the samples in the data set comprise news text original texts and news text infringement samples constructed on the basis of the news text original texts according to plagiarism rules;
(2) Calculating text digital fingerprint features based on an improved LSH method, comprising: calculating a word vector of a news text by using a word2vec model, calculating a TF (Term Frequency) value and an IDF (Inverse Document Frequency) value of a word, and taking a TF-IDF value which is the product of the TF value and the IDF value as the weight of a corresponding word vector in the text for weighting and summing to be used as a digital fingerprint feature of the news text;
(3) Constructing triple-tuple data according to the sample data set, taking the triple-tuple data as the input of an LSTM network model, and learning text semantic features by utilizing triple loss; one of the triple data comprises an Anchor instance, a Positive instance and a Negative instance, wherein the Anchor instance is news text original text, the Positive instance is an infringement sample constructed based on the news text original text, and the Negative instance is news text original text reporting the same event but not infringement as the Anchor instance;
(4) Fusing the digital fingerprint features of the news text to be detected, which are obtained by calculation according to the method in the step (2), with the semantic features of the news text to be detected, which are extracted based on the LSTM network model trained in the step (3), calculating the similarity between the fusion features of the news text to be detected and the fusion features of the news text in the copyright library subjected to copyright authentication, and further judging whether the news text to be detected has infringement behaviors.
In a preferred embodiment, the news text collected from the internet and the constructed infringement sample in step (1) are packaged into a corresponding UCL according to the UCL standard.
In a preferred embodiment, the plagiarism rule according to which the infringement sample is constructed in the step (1) comprises one or more of complete copying, adding and deleting operation, synonym/synonym replacement and adjusting a text structure.
In a preferred embodiment, the TF value of a word is calculated in said step (2) according to the following formula:
Figure BDA0001971312260000021
where f (w, d) represents the word frequency of word w in text d, and v represents the most frequently occurring word in text d.
In a preferred embodiment, the IDF value of a term is calculated in said step (2) according to the following formula:
Figure BDA0001971312260000022
where | D | represents the total number of texts in the sample data set, | { w ∈ D, D ∈ D } | is the number of texts containing word w.
In a preferred embodiment, the digital fingerprint features calculated in step (2) are represented as:
Figure BDA0001971312260000031
LSH (d) denotes a text locality-sensitive hash value of text d modified for use as a digital fingerprint feature, a w A word vector, tfidf, representing the word w in the text d w Is the calculated TF-IDF value of the word w.
In a preferred embodiment, the target loss function of the LSTM network model training in step (3) is:
Figure BDA0001971312260000032
wherein A is i For Anchor instance, P in a triplet i Is A i Positive example of (1), N i Is A i The Negative example of (b), f (·) represents the features extracted by the LSTM network, λ is a scale-up factor, α is a distance interval, N is the total number of triplets, | | 2 Represents the Euclidean distance, [.] + Represents max (., 0).
In a preferred embodiment, in the step (4), the digital fingerprint features and the semantic features of the news text to be detected are spliced and fused to obtain a fusion feature vector, and whether infringement exists is judged according to the cosine similarity between the fusion feature vector and the fusion feature vector of the news in the copyright library.
In a preferred embodiment, the news text to be detected in step (4) is news text actively submitted by a user or news text crawled on the internet without copyright certification.
The invention relates to a digital fingerprint and semantic feature-based news text infringement detection device which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the computer program realizes the digital fingerprint and semantic feature-based news text infringement detection method when being loaded to the processor.
Has the advantages that: compared with the prior art, the invention has the following advantages:
1. compared with the traditional detection method, the improved LSH detection method has the advantages that the word hash value is replaced by the word sense vector, and the infringement behaviors such as reference plagiarism and the like are easier to detect.
2. The method is based on the LSTM and the triple loss detection method, and can effectively distinguish the similar text from the infringing text.
3. The invention adopts a news text infringement detection method with the integration of digital fingerprint characteristics and semantic characteristics, and has higher accuracy, precision and recall rate on the detection result.
Drawings
FIG. 1 is a process flow diagram of an embodiment of the invention.
Fig. 2 is a flow chart of an improved LSH method in an embodiment of the present invention.
FIG. 3 is a flowchart of a method for training LSTM and triplet loss according to an embodiment of the present invention.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.
As shown in fig. 1, a method for detecting infringement of a news text based on digital fingerprints and semantic features disclosed in the embodiment of the present invention mainly includes the following specific implementation steps:
step 1, accumulating a sample data set. Without loss of generality, the embodiment first collects news of various categories from the internet, and ensures that the data of each category of news is uniform, and the news of all categories jointly form a sample data set D. Since the chinese news text has no public plagiarism data, the present embodiment is constructed manually and/or by machine. The step can be divided into the following 3 steps:
and a substep 1-1, crawling news text classification. And crawling news texts of corresponding categories on the Internet website, and ensuring the balance of the quantity of the news of each category.
And a substep 1-2, packaging news into UCL (unified Content Label) defined by the national Standard "unified Content Label Format Specification" (GB/T35304-2017). Downloading HTML (hypertext markup language) original text information, extracting key information from the HTML original text information, and packaging original news webpages to generate corresponding UCL according to the UCL standard. Packaging the UCL can facilitate copyright protection and authentication and avoid information tampering using the UCL dual-signature mechanism.
And a substep 1-3, constructing an infringement sample library. And changing the original text of the news content through different plagiarism forms, and constructing a corresponding UCL. The copy-up method is shown in Table 1.
TABLE 1 common plagiarism method
Figure BDA0001971312260000041
Figure BDA0001971312260000051
And 2, calculating the digital fingerprint characteristics of the text based on an improved LSH method. And after word segmentation and word stop processing are carried out on the data set, the correlation between words is used as the input after the LSH method is corrected, the text fingerprint characteristics are extracted, and the text digital fingerprint is constructed. As shown in fig. 2, the step can be further divided into the following 2 steps:
and a substep 2-1, calculating word vectors based on the word2vec model, and encoding each word through a Huffman tree by the calculation of the word2vec model in the embodiment to be used as the input of a neural network for training. Taking an objective function of a language model based on a neural network, and taking a log-likelihood function shown in formula (1):
L=∑ w∈C lnp(w|Context(w)) (1)
where C represents a corpus, w is a word appearing in the corpus, and Context (w) represents the Context of w, i.e., the collection of w adjacent words. This can map words to K-dimensional vectors (a) 1 ,a 2 ,…,a k )。
And a substep 2-2, calculating a text locality sensitive hash value, and firstly calculating a TF value of a word by using a formula (2):
Figure BDA0001971312260000052
wherein f (w, d) represents the word frequency of the word w in the text d, v represents the most frequently occurring word in the text, and the IDF value of the word is calculated by using the formula (3):
Figure BDA0001971312260000053
wherein | D | represents the total number of texts in the text set, | { w ∈ D, D ∈ D } | is the number of texts containing words w, and the denominator can handle the case where | { w ∈ D, D ∈ D } | is 0.
Calculating the TF-IDF value of each term using equation (4) based on the TF value and the IDF value of each term:
tfidf (w,D) =tf(w,d)×idf w,D (4)
in the traditional text locality sensitive hash calculation method, words are subjected to hash calculation and then multiplied by the weight of TF-IDF, word vectors obtained by calculation in the substep 2-1 are used for replacing word hash values, word senses are embedded into fingerprints, the correlation of the text locality sensitive hash values is enhanced, and locality sensitive characteristics are maintained. The digital fingerprint features obtained by calculation can be represented by formula (5), where d is a text, w is a word appearing in the text d, and a w A word vector, tfidf, representing the word w w The weight of the word w calculated for equation (4).
LSH(d)=∑ w∈d (a w ×tfidf w ) (5)
And 3, learning text semantic features based on the LSTM and the triple Loss. The step can be divided into the following 3 steps:
substep 3-1, constructing triple data; one triplet of data includes an Anchor instance, a Positive instance, and a Negative instance, where in the dataset used in this embodiment, anchor is an original news sample, positive is an infringing sample of Anchor, and Negative represents a news sample similar to Anchor but not infringing. And (3) realizing similarity calculation of samples by optimizing that the distance between the Anchor instance and the Positive instance is smaller than that between the Anchor instance and the Negative instance, wherein all the samples are news text feature matrixes constructed by the word vectors generated in the step 2-1.
According to the original text data D collected in the step 1 A And constructed plagiarism data D P Building a triplet (A) i ,P i ,N i ) Wherein A is i As an example of Anchor, P i Is A i Positive example of (1), N i Is A i Negative example of (N) i And A i Two news reports are the sameEvent, but not one party plagiarism the other), while a) i ,P i ,N i Satisfies formula (6):
d(A i ,P i ,)<d(A i ,N i )<d(A i ,P i ,)+α (6)
wherein d (A) i ,P i B) represents A i And P i A distance between d (A) i ,N j ) Represents A i And N i And α is the distance interval.
In this embodiment, the LSTM network is used to extract the low-dimensional features of the input data, where the triple-packet data is in the form of (f (a) i ),f(P i ),f(N i ) F () represents the extracted features, and according to the formula (6), it can be known that the distance requirement that the triplet needs to satisfy is as shown in the formula (7):
Figure BDA0001971312260000061
substep 3-2, training an LSTM network module; the objective loss function of the network obtained from equation (7) is equation (8):
Figure BDA0001971312260000062
wherein, the lambda is a scale amplification factor, and a random gradient descent and back propagation algorithm is used for network training. And when the network model is converged, obtaining the well-trained LSTM network, wherein the network input is a text word vector matrix, and the output is normalized text semantic features.
Substep 3-3, calculating semantic features of the text to be detected; and according to the LSTM network with the calculated weight in the substep 3-2, taking a word vector matrix of the text to be detected as input to obtain the semantic features of the text to be detected.
Step 4, text similarity detection based on digital fingerprint and semantic feature fusion; and (3) splicing and fusing the digital fingerprint features calculated in the step (2) and the semantic features extracted in the step (3), and calculating the cosine similarity of the fusion of the digital fingerprint and the semantic features so as to judge whether the text is infringing. For the feature vector, the correlation may be measured by any correlation or similarity method, the embodiment is described by taking Pearson Correlation Coefficient (PCC) as an example, and the PCC calculation formula is expressed as formula (9):
Figure BDA0001971312260000071
wherein, V X And V A A digital fingerprint and semantic feature fusion vector V respectively representing the text X to be detected and the original text A in the copyright library which has undergone copyright authentication X,i Represents V X In the case of the (i) th feature of (1),
Figure BDA0001971312260000072
represents V X Average of all features. In a specific detection scene, the text X to be detected can have two sources, namely, the infringement is actively avoided, and the text X is actively submitted by a user to be compared with a news in a copyright library; and secondly, passive defense and infringement are carried out, online collection is carried out by a crawler system, and all the news which is not authenticated are texts to be detected.
Based on the same inventive concept, the embodiment of the present invention further provides a device for detecting infringement of news text based on digital fingerprints and semantic features, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the computer program is loaded into the processor, the method for detecting infringement of news text based on digital fingerprints and semantic features is implemented.

Claims (9)

1. A news text infringement detection method based on digital fingerprints and semantic features is characterized by comprising the following steps:
(1) Collecting news texts of a plurality of categories through the Internet, and accumulating a sample data set; the samples in the data set comprise original news texts and infringement samples of the news texts constructed on the basis of the original news texts according to plagiarism rules;
(2) Calculating text digital fingerprint characteristics based on an improved LSH method, comprising the following steps: calculating word vectors of news texts by using a word2vec model, calculating TF values and IDF values of words, taking TF-IDF values which are products of the TF values and the IDF values as weights of corresponding word vectors in the texts, and performing weighted summation to obtain digital fingerprint characteristics of the news texts;
(3) Constructing triple group data according to the sample data set, taking the triple group data as the input of an LSTM network model, and learning text semantic features by utilizing triple loss; the method comprises the following steps:
(3-1) constructing triple group data, wherein one triple group data comprises an Anchor instance, a Positive instance and a Negative instance, the Anchor instance is news text original, the Positive instance is an infringement sample constructed based on the news text original, and the Negative instance is news text original reporting the same event but not infringement as the Anchor instance;
(3-2) training an LSTM network module; the target loss function for the LSTM network model training is:
Figure FDA0003938245260000011
wherein A is i For Anchor instance, P in a triplet i Is A i Positive example of (1), N i Is A i The Negative example of (b), f (·) represents the features extracted by the LSTM network, λ is a scale-up factor, α is a distance interval, N is the total number of triplets, | | 2 Denotes the Euclidean distance, [.] + Represents max (., 0);
(3-3) inputting the word vector matrix of the text to be detected as an LSTM network to obtain the semantic features of the text to be detected;
(4) Fusing the digital fingerprint features of the news text to be detected, which are obtained by calculation according to the method in the step (2), with the semantic features of the news text to be detected, which are extracted based on the LSTM network model trained in the step (3), calculating the similarity between the fusion features of the news text to be detected and the fusion features of the news text in the copyright library subjected to copyright authentication, and further judging whether the news text to be detected has infringement behaviors.
2. The method for detecting infringement of news text based on digital fingerprints and semantic features as claimed in claim 1, wherein in the step (1), the news text collected from the internet and the constructed infringement sample are packaged into corresponding UCL according to UCL standard.
3. The method for detecting infringement of news text based on digital fingerprint and semantic features according to claim 1, wherein the plagiarism rules according to which the infringement samples are constructed in the step (1) comprise one or more of complete replication, add/delete operation, synonym/synonym replacement and text structure adjustment.
4. The method for detecting infringement of news text based on digital fingerprint and semantic features according to claim 1, wherein in the step (2), the TF value of a word is calculated according to the following formula:
Figure FDA0003938245260000021
where f (w, d) represents the word frequency of word w in text d, and v represents the most frequently occurring word in text d.
5. The method for detecting infringement of news text based on digital fingerprint and semantic features according to claim 1, wherein the IDF value of a word is calculated in step (2) according to the following formula:
Figure FDA0003938245260000022
where | D | represents the total number of texts in the sample data set, | { w ∈ D, D ∈ D } | is the number of texts containing word w.
6. A method for detecting infringement of news text based on digital fingerprint and semantic characteristics according to claim 1, wherein the digital fingerprint characteristics calculated in step (2) are represented as:
Figure FDA0003938245260000023
LSH (d) denotes a text locality-sensitive hash value of text d modified for use as a digital fingerprint feature, a w A word vector, tfidf, representing the word w in the text d w Is the calculated TF-IDF value of the word w.
7. The method for detecting infringement of news text based on digital fingerprints and semantic features as claimed in claim 1, wherein in the step (4), the digital fingerprint features and the semantic features of the news text to be detected are spliced and fused to obtain a fusion feature vector, and whether infringement exists is judged according to cosine similarity between the fusion feature vector and the fusion feature vector of news in a copyright library.
8. The method for detecting infringement of news texts based on digital fingerprints and semantic features as claimed in claim 1, wherein the news texts to be detected in step (4) are actively submitted by users or are crawled on the internet without copyright authentication.
9. A digital fingerprint and semantic feature based infringement detection apparatus comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the computer program when loaded into the processor implements a digital fingerprint and semantic feature based infringement detection method according to any of claims 1-8.
CN201910119330.5A 2019-02-18 2019-02-18 News text infringement detection method and device based on digital fingerprints and semantic features Active CN109918621B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910119330.5A CN109918621B (en) 2019-02-18 2019-02-18 News text infringement detection method and device based on digital fingerprints and semantic features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910119330.5A CN109918621B (en) 2019-02-18 2019-02-18 News text infringement detection method and device based on digital fingerprints and semantic features

Publications (2)

Publication Number Publication Date
CN109918621A CN109918621A (en) 2019-06-21
CN109918621B true CN109918621B (en) 2023-02-28

Family

ID=66961674

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910119330.5A Active CN109918621B (en) 2019-02-18 2019-02-18 News text infringement detection method and device based on digital fingerprints and semantic features

Country Status (1)

Country Link
CN (1) CN109918621B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113553839B (en) * 2020-04-26 2024-05-10 北京中科闻歌科技股份有限公司 Text originality identification method and device, electronic equipment and storage medium
CN112100372B (en) * 2020-08-20 2022-08-30 西南电子技术研究所(中国电子科技集团公司第十研究所) Head news prediction classification method
CN112597313B (en) * 2021-03-03 2021-06-29 北京沃丰时代数据科技有限公司 Short text clustering method and device, electronic equipment and storage medium
CN113111645B (en) * 2021-04-28 2024-02-06 东南大学 Media text similarity detection method
CN113326494B (en) * 2021-05-31 2023-08-18 湖北微特传感物联研究院有限公司 Identity information authentication method, system, computer device and readable storage medium
CN113269136B (en) * 2021-06-17 2023-11-21 南京信息工程大学 Off-line signature verification method based on triplet loss
CN113486176B (en) * 2021-07-08 2022-11-04 桂林电子科技大学 News classification method based on secondary feature amplification

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101976318A (en) * 2010-11-15 2011-02-16 北京理工大学 Detection method of code similarity based on digital fingerprints
US9298757B1 (en) * 2013-03-13 2016-03-29 International Business Machines Corporation Determining similarity of linguistic objects
CN105677661A (en) * 2014-09-30 2016-06-15 华东师范大学 Method for detecting repetition data of social media
CN107391671A (en) * 2017-07-21 2017-11-24 华中科技大学 A kind of document leakage detection method and system
CN107871002A (en) * 2017-11-10 2018-04-03 哈尔滨工程大学 A kind of across language plagiarism detection method based on fingerprint fusion
CN113111645A (en) * 2021-04-28 2021-07-13 东南大学 Media text similarity detection method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101976318A (en) * 2010-11-15 2011-02-16 北京理工大学 Detection method of code similarity based on digital fingerprints
US9298757B1 (en) * 2013-03-13 2016-03-29 International Business Machines Corporation Determining similarity of linguistic objects
CN105677661A (en) * 2014-09-30 2016-06-15 华东师范大学 Method for detecting repetition data of social media
CN107391671A (en) * 2017-07-21 2017-11-24 华中科技大学 A kind of document leakage detection method and system
CN107871002A (en) * 2017-11-10 2018-04-03 哈尔滨工程大学 A kind of across language plagiarism detection method based on fingerprint fusion
CN113111645A (en) * 2021-04-28 2021-07-13 东南大学 Media text similarity detection method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种基于指纹融合的跨语言剽窃检测技术;刘刚 等;《计算机应用研究》;20190131;第36卷(第1期);第168-174页 *
基于语义指纹的海量文本快速相似检测算法研究;姜雪 等;《电脑知识与技术》;20161231;第12卷(第36期);第175-177页 *

Also Published As

Publication number Publication date
CN109918621A (en) 2019-06-21

Similar Documents

Publication Publication Date Title
CN109918621B (en) News text infringement detection method and device based on digital fingerprints and semantic features
CN108965245B (en) Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model
CN109885773B (en) Personalized article recommendation method, system, medium and equipment
Pereira et al. Using web information for author name disambiguation
US9183173B2 (en) Learning element weighting for similarity measures
CN105426354B (en) The fusion method and device of a kind of vector
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
CN109543674B (en) Image copy detection method based on generation countermeasure network
Riadi Detection of cyberbullying on social media using data mining techniques
CN111444387A (en) Video classification method and device, computer equipment and storage medium
Shawon et al. Website classification using word based multiple n-gram models and random search oriented feature parameters
US20240211496A1 (en) Systems and Methods for Determining Entity Attribute Representations
CN114722141A (en) Text detection method and device
CN111967503A (en) Method for constructing multi-type abnormal webpage classification model and abnormal webpage detection method
Dutta et al. PNRank: Unsupervised ranking of person name entities from noisy OCR text
CN112579771A (en) Content title detection method and device
Zhu et al. CCBLA: a lightweight phishing detection model based on CNN, BiLSTM, and attention mechanism
CN111984867A (en) Network resource determination method and device
Farooq et al. Fake news detection in Urdu language using machine learning
CN112966103B (en) Mixed attention mechanism text title matching method based on multi-task learning
CN113111645B (en) Media text similarity detection method
Liu et al. Detecting web spam based on novel features from web page source code
CN116258600A (en) Multi-modal feature fusion social media content propagation prediction method
Wang [Retracted] Analysis of User Personalized Retrieval of Multimedia Digital Archives Dependent on BP Neural Network Algorithm
Barve et al. A Novel Text Resemblance Index Method for Reference-based Fact-checking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant