CN107862045B

CN107862045B - Cross-language plagiarism detection method based on multiple features

Info

Publication number: CN107862045B
Application number: CN201711084337.5A
Authority: CN
Inventors: 刘刚; 胡昱临; 李光曦
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2017-11-07
Filing date: 2017-11-07
Publication date: 2022-01-14
Anticipated expiration: 2037-11-07
Also published as: CN107862045A

Abstract

The invention provides a multi-feature-based cross-language plagiarism detection method. (1) Constructing a corpus; (2) constructing translation features, namely constructing the translation features according to the common Eurasian phenomenon and translation problems of translation articles, cleaning the features in a feature selection mode to screen out effective features, and filtering ineffective features or features with unobvious effects; (3) selecting characteristics, namely selecting a plurality of effective characteristics from the characteristics to train a classifier so as to distinguish whether one or a plurality of Chinese articles have the problem of cross-language plagiarism; (4) based on the plagiarism detection corresponding to the characteristics, the method accurately corresponds English characteristics aiming at Chinese characteristics, filters and generates plagiarism results according to translation characteristics and structural characteristics, and finally confirms the plagiarism results through WordNet. The invention can solve the problem of cross-language plagiarism according to a plurality of characteristics mined from the translated text.

Description

Cross-language plagiarism detection method based on multiple features

Technical Field

The invention relates to a method for detecting whether an article has plagiarism.

Background

(1) Europeanization phenomenon and translation body problem discovery in English-Chinese translation

The conversion of English and Chinese languages brings latent and imperceptible changes to the two languages, including accent, vocabulary, grammar, and expressions. Although the effects of the two languages are mutual, the effect of english on chinese is far greater than the effect of chinese on english, by comparison. Cross-language plagiarism detection occurs when single-language plagiarism detection is increasingly unable to meet the academic misadventure problems it encounters. However, the monolingual plagiarism detection technique is not applicable in cross-language plagiarism detection. Currently, the most popular methods for cross-language plagiarism detection include two methods, namely cross-language information retrieval (CLIR) and cross-language similarity detection (CLSD).

The two languages can be corresponded by means of a dictionary, a corpus or a translation. In research, various problems such as word ambiguity and ambiguity, sorting mode of output results, segmentation of query words, dependence on multi-language resources, etc. need to be solved. CLIR has become popular in recent years, and 7 of the 10 methods in the cross-language assessment forum of 2009 have used machine translation directly to convert two languages into one language.

CLSD refers to cross-language similarity detection technology, which has many similarities to CLIR. In CLSD, similarities between articles in different languages are compared. There are many commonly used CLSD algorithms, mainly including language grammar-based, machine translation-based, professional dictionary-based, parallel or comparable corpus-based, and semantic web-based algorithms.

In foreign countries, there are many relevant articles about cross-language plagiarism, such as english-arabic, english-german, english-czech, etc. However, foreign articles do not relate to the detection of the plagiarism of English-Chinese due to the particularity of Chinese, but other bilingual articles have great reference significance for the detection of English-Chinese.

(2) Feature engineering is utilized. Feature engineering is the process of utilizing relevant knowledge of the data domain to create features that enable machine learning algorithms to achieve optimal performance. "data and features determine the upper limit of machine learning, while models and algorithms only approximate this upper limit". It follows that feature engineering plays an extremely important role in machine learning. In practical applications, feature engineering is the key to successful machine learning.

Under defined data, training algorithms with too many features are too complex and prone to overfitting. So after the features are built, feature selection needs to be done first. The method aims to select a feature subset with the most statistical significance from a plurality of features, so that on one hand, the purpose of screening invalid features can be achieved, on the other hand, the dimension reduction is carried out on a feature space, and the complexity of a model is reduced.

Disclosure of Invention

The invention aims to provide a multi-feature-based cross-language plagiarism detection method which can solve the problem of cross-language plagiarism according to a plurality of features excavated from a translated text.

The purpose of the invention is realized as follows:

(1) firstly, a corpus is constructed

The corpus is divided into a Chinese training set and a Chinese test set, the corpus is divided into two types, the first type of corpus is a Chinese article with cross-language plagiarism, and the part of corpus is obtained by automatically translating English documents; the second corpus is an original Chinese article, and the corpus is obtained by downloading an authoritative Chinese paper;

the construction method of the second corpus comprises the following steps: crawling a great amount of English articles by a crawler, automatically translating in batch by a program to obtain a plagiarism document, processing the batch of articles with a PDF format with a specific number, forming m pure text files with the file name of n.m, wherein m is the number of paragraphs of the articles, mainly comprising the following three steps,

1) converting the PDF format document into an XML format document which can be marked by text;

2) according to XML tags, information of various texts is converted, a paragraph is arranged between , a document is read in sequence, after the is read, a special mark is added in front of the tag, other tags and contents between the tags are removed, and the rest in the document is the document added with the special mark in front of each paragraph;

3) reading the document added with the special mark by a program, writing the content behind the special mark into a plain text document when the special mark is read, and removing the special mark;

(2) construction of translation features

The method comprises the steps of constructing translation features according to the common Eurasia phenomenon of translation articles and translation problems, cleaning the features in a feature selection mode to screen out effective features, and filtering invalid features or features with unobvious effects;

(3) feature selection

Selecting a plurality of effective features from the features to train a classifier so as to distinguish whether one or a plurality of Chinese articles have the problem of cross-language plagiarism;

(4) plagiarism detection based on feature correspondence

Aiming at the characteristics of Chinese, carrying out accurate correspondence of English characteristics, filtering and generating plagiarism results according to translation characteristics and structural characteristics, and finally confirming the plagiarism results through WordNet;

the method mainly comprises four stages, namely a first stage, a plagiarism candidate set preprocessing stage, and a third stage, wherein the Chinese and English plagiarism candidate set is subjected to paragraph division and part of speech tagging; the second stage, a primary filtering stage, which is used for carrying out accurate characteristic correspondence according to the characteristics of the translated text and realizing a paragraph distance calculation algorithm; the third stage, secondary filtering stage, filtering again according to the structural characteristics; and the fourth stage of final result confirmation, which is to utilize WordNet to finally confirm plagiarism result and obtain final plagiarism result.

Language dissimilarity of articles is the most major obstacle to detecting cross-language plagiarism. It is a common approach to consider unifying the representation of both languages. There is an inaccuracy problem in the form unification process. Some forms are poor in uniformity, for example, the forms are translated into the same language and then compared, the requirement on the translation effect is high, and different translations bring large influence degrees, so that the plagiarism detection result is uncertain. There are also word alignment issues. In the detection process, the word alignment problem of the cross-language document is realized through a word stock or a parallel pre-material stock. This thesaurus or parallel corpus is large and requires multiple probability dictionaries, which is labor and material intensive and difficult to expand over multiple languages.

The invention combines the particularity of Chinese and detects whether an article has plagiarism behavior by searching a literary style which does not conform to Chinese habit so as to narrow the plagiarism detection range. Therefore, the problems of poor translation effect and inaccurate disambiguation can be avoided, and the purpose of detecting cross-language plagiarism is achieved by using a unique method. The method has important significance for Chinese academic mischievous detection, and is beneficial to the standardization of academic wind and qi and the improvement of scientific research level.

The invention firstly provides a method for automatically dividing paragraphs, then filtering paragraphs which do not meet the conditions from plagiarism candidates through twice filtering based on translation feature correspondence and structural feature correspondence, screening paragraphs which meet the conditions to form plagiarism results, and finally confirming the plagiarism results by using WordNet. Finally, the experimental results are verified, and the result analysis of calculating the similarity by comparing the existence and the non-existence of the two times of filtering shows the effectiveness of the two times of filtering.

Analysis of results

After segmentation, each article is divided into paragraphs. For a particular test set, a certain segment in a certain segment is designated with a fixed number, and this number is taken as a document name. For example, C307 represents paragraph 07 of the 3 rd paragraph of the Chinese candidate set. Similarly, the English test set is denoted by E. The following experiment will take 1000 paragraphs of the first 142 articles as test sets for statistical convenience.

For English paragraphs of a Chinese paragraph, all that does not meet the condition is filtered out. After comparing 1000 paragraphs, 749 paragraphs are filtered twice, and only one suspicious paragraph is retained, wherein 736 paragraphs are exactly matched with the paragraph which is pirated, and only 13 paragraphs have matching errors. In the remaining 251 paragraph candidate sets, 24 paragraphs were filtered twice without matching with the suspicious paragraphs, and 227 paragraphs with multiple suspicious paragraphs. In 227 paragraphs matching with multiple results, specific plagiarism paragraphs need to be screened out, which will complete the confirmation of the final result by means of WordNet dictionary. Statistics shows that 220 paragraphs out of 227 verified paragraphs realize accurate correspondence of plagiarism results, and only 7 paragraphs are mistakenly screened, which is why errors occur when WordNet calculates similarity, and the correct paragraph does not obtain the maximum similarity. But the plagiarized paragraphs are in the suspicious paragraphs after filtering, and the effectiveness of the result is laterally illustrated.

Through comparison verification of 1000 paragraphs, the accuracy achieved by the experiment is 74%. In the confirmation of the final result, the present invention is more effective than the method without two filtrations.

Drawings

Fig. 1 is a cross-language plagiarism detection classification technology route.

Fig. 2 is a roadmap of a cross-language hacking detection method.

Fig. 3 is an overall framework diagram of cross-language hacking detection.

Fig. 4 a diagram of a cross-language plagiarism detection module.

FIG. 5 is a detailed flow diagram of automatic paragraph segmentation.

Detailed Description

For cross-language plagiarism, it should be determined whether there is cross-language plagiarism in an article, and then it can be determined which paragraphs or parts of the article have cross-language plagiarism. Aiming at the problems, the invention mainly discovers and selects effective translation characteristics from the Chinese articles with cross-language plagiarism, gives different characteristic weights, constructs a classification model with cross-language plagiarism, can classify the given Chinese articles, and detects the possible plagiarism of the Chinese articles and the non-plagiarism of the Chinese articles.

The invention aims to solve the problem of cross-language plagiarism according to a plurality of characteristics mined from a translated text by constructing a multi-characteristic-based cross-language plagiarism detection technology. The invention firstly analyzes and summarizes the research status of single and double language plagiarism, and provides a multi-feature-based cross-language plagiarism detection model, which comprises multi-feature-based cross-language plagiarism classification and cross-language plagiarism detection corresponding to features. In the process, a Chi-square test algorithm is applied to feature selection, the number of features in the text and the stability degree of the features in the category are considered on the basis of the Chi-square test algorithm, and then the weight of each feature is calculated by using a new feature weight calculation method. The selected features are associated with the English expression forms, and an algorithm for calculating feature distances between paragraphs is provided to compare the corresponding Chinese and English paragraphs. The structural feature correspondence is to compare the structures of Chinese and English paragraphs, keep paragraphs with similar structures, and filter paragraphs with large differences. Finally, the detection result is verified and confirmed by a method based on WordNet, and the purpose of cross-language detection is finally achieved.

The technical means of the invention mainly comprises:

1. a corpus is constructed, and the corpus is divided into a Chinese training set and a Chinese testing set.

Considering that the corpus is used as a training set to perform supervised classification on articles, the corpus needs to be divided into two categories, one is a Chinese article with cross-language plagiarism. The acquisition of the part of the corpus can be obtained by automatically translating the English document. The other is original Chinese article, and the acquisition of the corpus can download the authoritative Chinese paper.

Among them, the second corpus is relatively easy to obtain. For the first corpus, the corpus almost does not exist at home and abroad, so the invention firstly uses the crawler to crawl a large number of English articles and obtains the plagiary Chinese documents by program batch automatic translation.

The method mainly realizes the processing of batch articles in PDF format with specific numbers, wherein one article with the number of n can form m plain text files with the file name of n.m, and m is the number of paragraphs of the article.

The key steps include the following three steps.

(1) The document in the PDF format is converted into a text markable XML format document. XML is known as extensible markup language and is used primarily to store data and structures. Compared with HTML, the Web language is loose and not strict in grammar. Therefore, the XML format is converted, the label on the XML format can clearly mark each paragraph, and a good basis is laid for paragraph division. In addition, after PDF is converted into XML, the header and footer in the article are all automatically removed. The trouble of manually removing the page header and the page footer is saved.

(2) And converting the information of various texts according to the XML tags. It can be seen that there is a paragraph between . Reading the document in turn, adding the special mark in front of the label after reading , and removing other labels and the content between the labels. The rest of the documents are the documents with special marks added in front of each segment.

(3) The program is used to read the document to which the special mark is added. The contents of the document after the special mark are written into a plain text document every time the special mark is read, and the special mark is removed. Such a plain text document may correspond to a segment of an article.

2. And (5) constructing translation features. Features are essential in machine learning, and the accuracy of a learner depends greatly on how well the features are constructed. According to the method, translation characteristics are constructed according to the common Eurasia phenomenon and translation problems of translation articles, and effective characteristics can be screened out, and invalid characteristics or characteristics with unobvious effects can be filtered out by cleaning the characteristics in a characteristic selection mode. The selection of features was performed by the Chi-Square-test (CHI) method.

3. And (4) selecting the characteristics. The constructed features are not all effective features, and a plurality of effective features are selected from the features to train a classifier so as to distinguish whether one or a plurality of Chinese articles have the problem of cross-language plagiarism.

4. Detection of plagiarism based on feature correspondence. Aiming at the characteristics of Chinese, English characteristics are accurately corresponded, the plagiarism result is filtered and generated according to the translation characteristics and the structural characteristics, and finally the plagiarism result is confirmed through WordNet. The Chinese and English characteristics are corresponding by the characteristics, paragraphs which do not correspond to the characteristics are filtered, and the range of plagiarism results is greatly reduced. After the distance between the paragraphs corresponding to the translation features is calculated, a large number of paragraphs which do not meet the corresponding conditions can be filtered by Chinese paragraphs in the plagiarism candidate set, and a plurality of suspicious English plagiarism paragraphs are reserved, so that the range of plagiarism results is greatly reduced. Structural features will then be further extracted to filter the plagiarism results a second time. Five structural features are selected: the length of the sentence, the length of the noun in the sentence, the length of the verb in the sentence, the length of the adjective in the sentence, and the length of the adverb in the sentence are used for further screening and filtering the plagiarism candidate set.

Mainly divided into four stages. The first stage, the hacking candidate set preprocessing stage. The Chinese and English plagiarism candidate set is subjected to paragraph division and part of speech tagging, so that accurate positioning and post-processing of plagiarism positions are facilitated. The second stage, the primary filtration stage. And carrying out accurate feature correspondence according to the translation features and realizing a paragraph distance calculation algorithm. The third stage, the second filtration stage. And (5) filtering the plagiarism result again according to the structural characteristics to realize a filtering algorithm based on the structural characteristics. And the fourth stage, final result confirmation stage. Finally confirming the plagiarism result by using WordNet to obtain the final plagiarism result.

The main processes of the present invention will be described in more detail with reference to the accompanying drawings.

Text preprocessing

Inputting: textual information to be analyzed

And (3) outputting: word set after word segmentation

The sentence is taken as a unit and used as a basis for term extraction, and at the stage, a data set is extracted through the processes of sentence breaking, word segmentation and stop word filtering of the text.

Vocabulary filtering stop word filtering

After chinese segmentation of a text, individual character sequences can be obtained. It is not difficult to find out that the nouns and verbs which have a large influence on semantic expression in the sentence are majority, and the vocabularies of other modifying components have a small influence on the semantic of the sentence, so that the names or verbs which have an influence on the semantic of the sentence need to be retained and called as the retained words. However, in these sequences after word segmentation, there are some words that appear in the text with high frequency, but actually do not have much influence on the text analysis. These words are mainly composed of moods, prepositions, adverbs, etc., and these words themselves have no definite meaning and only serve some purposes when they are part of a sentence. Such words are called stop words. Therefore, if the irrelevant words in the domain text can be completely filtered, the storage space of the system can be greatly saved, and the workload and the calculation amount of the later period in the verification system can be reduced. Therefore, the design of the preprocessing part not only needs to select proper word segmentation and labeling algorithms and perform proper improvement, but also needs to properly design the process of word filtering. The process is shown in figure 3.

The preconditions are as follows: performing word segmentation operation

Inputting: character set to be filtered

And (3) outputting: text vocabulary set after vocabulary filtering

Step 1: after Chinese word segmentation, a vocabulary set to be filtered can be obtained.

Step 2: and loading the stop word text, reading a vocabulary from the set, and searching the vocabulary in the stop word text. If the character is found, the character needs to be filtered, otherwise, the character is filtered.

Third section, feature construction

The present invention summarizes certain types of text features and representations of features that are commonly found in plagiarism text and are not commonly found in non-plagiarism text.

The preconditions are as follows: performing a complete vocabulary filtering operation

And (3) outputting: europeanized feature

And step 3: a selection of europed features is made.

And 4, step 4: returning the selected feature set;

fourth, feature selection

The preconditions are as follows: and performing the completion feature construction.

Inputting: feature set

And (3) outputting: selected feature set

And 5: and performing chi-square inspection on the characteristics.

Step 6: each feature is scored by inspection to find the appropriate parameter weight for each feature. And sorting is performed.

Fifth part, SVM model construction

The SVM is one of the most common classifiers with the best effect at present, is based on the principle of minimizing the structural risk, has strong generalization capability, and has the advantages that the local optimal solution is equivalent to the global optimal solution, and the convex quadratic programming problem is solved.

The preconditions are as follows: the feature selection is complete.

Inputting: the selected eurization feature.

And (3) outputting: SVM models based on Euclidean characteristics.

And 7: and selecting a parameter C, and replacing the inner product with RBF to obtain the dual problem of the SVM.

And 8: and updating the variables which do not meet the constraint condition by using an SMO algorithm until all the variables meet the KTT condition.

Paragraph auto-segmentation

Because the article is stored in the PDF format, how to accurately convert the PDF format into the segmented TXT plain text format needs to be considered. Many third-party software or third-party open source libraries for converting the PDF format into the TXT format exist on the network, but the effect is not obvious, and the result has many content and format errors, which seriously affect the quality of the corpus and further cause inaccuracy of the later result. The invention therefore takes a particular approach to achieve this goal.

The preconditions are as follows: SVM construction is completed

Inputting: text

And (3) outputting: divided paragraph

And step 9: converting the PDF format document into an XML format document which can be marked by text;

step 10: and converting the information of various texts according to the XML tags.

Step 11: the program is used to read the document to which the special mark is added.

Seventh part, English part of speech label

The preconditions are as follows: paragraph segmentation

Inputting: divided paragraph

And (3) outputting: part of speech tagging results

Step 12: the filtering problem of the existing table and picture conditions is as follows: the forms and pictures have special marks in the XML document, which are different from marks of paragraphs, and are not read in advance during segmentation.

Step 13: filtering of titles and paragraph merging problems: on the one hand, the marking of the title is sometimes repeated with the marking of the paragraph, and the title is required to be identified and filtered by using as the marking. On the other hand, when PDF is converted into XML, truncation may occur between paragraphs, and a paragraph may be divided into two or more segments, where merging is required.

Eighth, once filtering plagiarism result based on translation feature

The preconditions are as follows: part-of-speech tagging

Inputting: annotated text

And (3) outputting: first filtered text

Step 14: and performing Chinese and English alignment according to the characteristics.

Step 15: paragraph distances of the text are calculated.

Ninth, secondary filtering of plagiarism results

Step 16: a Chinese plagiarism paragraph and a primarily filtered plagiarism result screened from the previous section are given and compared one by one, if a certain characteristic exceeds a specific threshold value, the Chinese plagiarism paragraph is filtered from the plagiarism result, and the rest paragraphs after filtering are the plagiarism result after secondary filtering.

Claims

1. A cross-language plagiarism detection method based on multiple characteristics is characterized in that:

(1) constructing a corpus;

(2) construction of translation features

(3) feature selection

(4) plagiarism detection based on feature correspondence

the detection of plagiarism based on the characteristic correspondence is divided into four stages, namely a first stage, a stage of preprocessing plagiarism candidate set, and paragraph division and part of speech tagging are carried out on the Chinese and English plagiarism candidate set; the second stage, a primary filtering stage, which is used for carrying out accurate characteristic correspondence according to the characteristics of the translated text and realizing a paragraph distance calculation algorithm; the third stage, secondary filtering stage, filtering again according to the structural characteristics; the fourth stage, the final result confirms the stage, use WordNet to confirm the plagiarism result finally, get the final plagiarism result; five structural features are selected: the length of the sentence, the length of the noun in the sentence, the length of the verb in the sentence, the length of the adjective in the sentence, and the length of the adverb in the sentence are used for further screening and filtering the plagiarism candidate set.

2. The multi-feature-based cross-language pirate detection method according to claim 1, wherein the constructing a corpus specifically comprises:

the first corpus construction method comprises the following steps: crawling a great amount of English articles by using a crawler, automatically translating in batch by a program to obtain a plagiarism document, processing the batch of articles with a PDF format with a specific number, forming m pure text files with the file name of n.m, wherein m is the number of paragraphs of the articles, and comprises the following three steps,

3) the program is used for reading the document added with the special mark, writing the content behind the special mark into a plain text document every time the special mark is read, and removing the special mark.