CN107885706A

CN107885706A - A kind of system of data similarity detection

Info

Publication number: CN107885706A
Application number: CN201711077910.XA
Authority: CN
Inventors: 崔垒
Original assignee: Foshan Zhangyang Technology Co Ltd
Current assignee: Foshan Zhangyang Technology Co Ltd
Priority date: 2017-11-06
Filing date: 2017-11-06
Publication date: 2018-04-06

Abstract

A kind of system of data similarity detection provided by the invention, including input data module, data register module, database module, screening module, pretreatment module, pre-detection module, result output module and computing module, acted on by each function synergic of each module, the paper of each colleges and universities is avoided to plagiarize getting worse phenomenon, undergraduate is improved for thesis attention degree, novelty and initiative of the former pupil to thesis are lifted, while can prevent from excessively quoting other people document.

Description

A kind of system of data similarity detection

Technical field

The present invention relates to Data Detection field, more particularly to a kind of system of data similarity detection.

Background technology

Information is emerged as due to the network information technology, the communication technology, digital library and digital disributed media comprehensively Resource can efficiently be shared with propagating the necessary condition provided.These emerging network resources, which become in contemporary society, to be stored The medium of information resources.Due to being stored in online information resources stored using platform that is open, mutually enjoying mostly so that people Can be more convenient, quickly, effective acquisition and propagate these information.But in the same time, it provides a reason for plagiarism person The hotbed thought so that plagiarism becomes more succinctly, conveniently.On being accused of plagiarizing, usurping the new of others' fruit of labour in current social It is of common occurrence to hear report.These ways of these people are that the specification that is offensive to morality even forms crime, tight in this intellectual property In the epoch of lattice protection, the condemnation and thinking of people are triggered.Plagiarize and bring significant negative impact to society, just as one in society Moth, China is limited to scientific knowledge and the sound development of technical research, causes the concern and attention of society.Plagiarize institute Caused by serious consequence it is self-evident.First, the right of personality of copyright for having invaded author is plagiarized.Right of personality of copyright is provided in law The exlusive right enjoyed by original work of author, mainly protect publication right of the author to works, the power of amendment.This is pair The affirmative of the fruit of labour of scientific research personnel and protection, so as to excite more personnel to carry out the innovation research of knowledge and technology, Promote scientific and technological progress.Secondly, the property right of an author for also having invaded author is plagiarized.Works are the wisdom crystallizations that author creates hard, Corresponding social remuneration is naturally obtained, while is also a kind of value accreditation to author's contribution, and is also to promote society The power constantly to advance.

Behavior for plagiarism, we are while disclosing, condemning and condemning, it should also are added by sound mechanism It is strong to take precautions against, avoid having occurred and that the digital product of cribbing is issued in digital media.

At the same time, the paper of each colleges and universities plagiarizes phenomenon getting worse, and undergraduate does not have enough weights for thesis Depending on having had a strong impact on the initiative of graduates, innovative and authenticity.Therefore the detection of urgent need data similarity is System, innovation ability of the former pupil to thesis is improved, prevent from excessively quoting the phenomenon for using for reference other people document.

The content of the invention

It is an object of the present invention to provide a kind of system of data similarity detection, the paper of each colleges and universities is avoided to plagiarize Getting worse phenomenon, undergraduate is improved for thesis attention degree, novelty of the lifting former pupil to thesis And initiative, while can prevent from excessively quoting other people document.

A kind of system of data similarity detection, input data module, data register module, database module, screening mould Block, pretreatment module, pre-detection module, result output module and computing module；

The input data module is connected with the data register module；

The data register module is connected with the database；

The database module is connected with the screening module；

The screening module is connected with the pretreatment module；The screening module is connected with the computing module；

The pretreatment module is connected with the pre-detection module；

The computing module is connected with the output module；

The computing module include Text Pretreatment module, Chinese word segmentation module, improved similarity calculation module and Text similarity computing；

The Text Pretreatment module is connected with the Chinese word segmentation module；

The Chinese word segmentation module is connected with the improved similarity calculation module；

The improved similarity calculation module is connected with the Text similarity computing.

Specifically, the database module is paper database module.

Specifically, it is characterised in that the input data module and the data register module operate completion by keeper.

Specifically, the screening module is used to obtain the paper related to pre- duplicate checking paper, filters out most paper.

Specifically, the related paper of the pre- duplicate checking paper includes subject, specialty, thesis topic and paper keyword etc. Content.

Specifically, the computing module is used for the sentence similarity for calculating paper.

Specifically, the Text Pretreatment module carries out subordinate sentence processing to text, and the sentence of processing is according in the position of text Put and arrange and preserve in order.

As seen through the above technical solutions：A kind of system of data similarity detection provided by the invention, including input number According to module, data register module, database module, screening module, pretreatment module, pre-detection module, result output module with And computing module, acted on by each function synergic of each module, avoid the paper of each colleges and universities from plagiarizing getting worse phenomenon, improve undergraduate course Life lifts novelty and initiative of the former pupil to thesis, while can prevent for thesis attention degree Degree quotes other people document.

Brief description of the drawings

Some specific embodiments of the present invention are described in detail by way of example, and not by way of limitation with reference to the accompanying drawings hereinafter. Identical reference denotes same or similar part or part in accompanying drawing.It should be appreciated by those skilled in the art that these What accompanying drawing was not necessarily drawn to scale.

Fig. 1 is the structure chart for the system that a kind of data similarity of the embodiment of the present application detects.

Fig. 2 is the computing module structure chart for the system that a kind of data similarity of the embodiment of the present application detects.

Reference numeral explanation：1st, input data module；2nd, data register module；3rd, database module；4th, screening module；5、 Pretreatment module；6th, pre-detection module；7th, result output module；8th, computing module；81st, Text Pretreatment module；82nd, Chinese point Word module；83rd, improved similarity calculation module；84th, Text similarity computing.

Embodiment

This below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out it is clear, Complete description, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.Base Embodiment in the present invention, those of ordinary skill in the art obtained under the premise of creative work is not made it is all its His embodiment, belongs to the scope of protection of the invention.

Refer to Fig. 1 and Fig. 2, a kind of system of data similarity detection, input data module 1, data register module 2, Database module 3, screening module 4, pretreatment module 5, pre-detection module 6, result output module 7 and computing module 8；

The input data module is connected with the data register module 2；

The data register module 2 is connected with the database 3；

The database module 3 is connected with the screening module 4；

The screening module 4 is connected with the pretreatment module 5；The screening module 4 is connected with the computing module 8；

The pretreatment module 5 is connected with the pre-detection module 6；

The computing module 8 is connected with the output module 7；

The computing module 8 includes Text Pretreatment module 81, Chinese word segmentation module 82, improved similarity calculation module 83 and Text similarity computing 84；

The Text Pretreatment module 81 is connected with the Chinese word segmentation module 82；

The Chinese word segmentation module 82 is connected with the improved similarity calculation module 83；

The improved similarity calculation module 83 is connected with the Text similarity computing 84.

The input data module 1 and the data register module 2 only have keeper to be imported in just having permission toward database Paper establishes local paper database.

Data subject, specialty, thesis topic, paper keyword, paper as corresponding to paper are former in the database module 3 Text composition after text, the processing of paper subordinate sentence.

The main purpose of screening module 4 is to filter out most paper document in back-end data, is obtained and in duplicate checking The related paper of document, reaches the purpose for improving running efficiency of system.This module core part is described as follows：The first step：It is first Subject, specialty, thesis topic, the paper keyword of the paper of pre- duplicate checking are first extracted, assigns different weight αs, β, γ, δ respectively, Then the paper of identical subject in database is filtered out according to subject.Second step：Using ICTCLAS Chinese word segmentation systems to the above Four character strings are segmented, extract Feature Words after obtain Feature Words, the subject of the paper screened respectively with the first step, Specialty, thesis topic, paper keyword carry out Similarity Measure and respectively obtain similarity SimA, SimB, SimC, SimD.

3rd step：Weighting is handled the similarity of three respectively to more than, obtains this four total similarities, calculation formula is such as Under：Sim (A, B, C, D)=α × SimA+ β × SimB+ γ × SimC+ δ × SimD.When calculating total similarity Sim (A, B, C, D) when being more than given threshold value 0.7, then it is assumed that this two papers are possible to similar, and then this paper text is carried out at subordinate sentence Reason, then carries out the Similarity Measure of sentence, and records and preserve similar sentence.Here subject weight α=0.2, professional weight beta =0.1, thesis topic weight γ=0.4, keyword weight δ=0.3.

The Text Pretreatment module 81 mainly carries out subordinate sentence processing to text, and the sentence of processing is according in text Position arranges and preserved in order, to carry out processing operation below.

The Chinese word segmentation module 82 is calculated the similarity of Chinese document, and some occurred in all sentences are special The non-character for depositing Chinese such as symbol, English, numeral, can select to be filtered.Then ICTCLAS Chinese word segmentation systems pair are used Sentence is segmented.Some stop words are removed after participle.

The method that the improved similarity calculation module 83 is combined based on semantic dependent tree and improved edit-distance, come Calculate the similarity between sentence.

The Text similarity computing 84 divides the document into sentence, then using based on semantic dependent tree and improve editor away from From the similarity that algorithm obtains sentence, when the similarity of sentence is more than threshold value, we then think that two sentences are similar.Then unite Count similar sentence sum and account for the total percentage of sentence that the text of preliminary examination is divided into, and then obtain the similarity of text, Ran Houyong The result of the form outgoing inspection of Excel forms.Pre- duplicate checking document overall similarity calculation formula is as follows：SimAll=n/m × 100%, SimAll represent the text similarity of pre- duplicate checking document meter；N represents the Similarity Measure that pre- duplicate checking document passes through sentence The similar sentence sum counted afterwards；M represents the sentence sum that pre- duplicate checking document obtains after being handled by subordinate sentence.Because n begins It is less than eventually equal to m, so the result of Similarity Measure is on [0%, 100%] section, when similar sentence sum is equal to During sentence sum (during n=m), then the Documents Similarity calculated is 100%；When similar sentence sum is 0, then calculate The Documents Similarity gone out is 0%.

Further, the database module 3 is paper database module.

Further, it is characterised in that the input data module 1 and the data register module 2 are operated by keeper Complete.

Further, the screening module 4 is used to obtain the paper related to pre- duplicate checking paper, filters out most opinion Text.

Further, the related paper of the pre- duplicate checking paper includes subject, specialty, thesis topic and paper keyword Etc. content.

Further, the computing module 8 is used for the sentence similarity for calculating paper.

Further, the Text Pretreatment module 81 carries out subordinate sentence processing to text, and the sentence of processing is according in text Position arrange and preserve in order.

So far, although those skilled in the art will appreciate that detailed herein have shown and described multiple showing for the present invention Example property embodiment, still, still can be direct according to present disclosure without departing from the spirit and scope of the present invention It is determined that or derive many other variations or modifications for meeting the principle of the invention.Therefore, the scope of the present invention is understood that and recognized It is set to and covers other all these variations or modifications.

Claims

1. a kind of system of data similarity detection, including：Input data module (1), data register module (2), database mould Block (3), screening module (4), pretreatment module (5), pre-detection module (6), result output module (7) and computing module (8)；

The input data module is connected with the data register module (2)；

The data register module (2) is connected with the database (3)；

The database module (3) is connected with the screening module (4)；

The screening module (4) is connected with the pretreatment module (5)；The screening module (4) connects with the computing module (8) Connect；

The pretreatment module (5) is connected with the pre-detection module (6)；

The computing module (8) is connected with the output module (7)；

The computing module (8) includes Text Pretreatment module (81), Chinese word segmentation module (82), improved Similarity Measure mould Block (83) and Text similarity computing (84)；

The Text Pretreatment module (81) is connected with the Chinese word segmentation module (82)；

The Chinese word segmentation module (82) is connected with the improved similarity calculation module (83)；

The improved similarity calculation module (83) is connected with the Text similarity computing (84).

2. the system of data similarity according to claim 1 detection, it is characterised in that the database module (3) is Paper database module.

3. the system of data similarity detection according to claim 1, it is characterised in that the input data module (1) Completion is operated by keeper with the data register module (2).

4. the system of data similarity detection according to claim 1, it is characterised in that the screening module (4) is used for The paper related to pre- duplicate checking paper is obtained, filters out most paper.

5. the system of data similarity detection according to claim 4, it is characterised in that the pre- duplicate checking paper correlation Paper includes the contents such as subject, specialty, thesis topic and paper keyword.

6. the system of data similarity detection according to claim 1, it is characterised in that the computing module (8) is used for Calculate the sentence similarity of paper.

7. the system of data similarity detection according to claim 1, it is characterised in that the computing module (8) is used for Calculate the sentence similarity of paper.

8. the system of data similarity detection according to claim 1, it is characterised in that the Text Pretreatment module (81) subordinate sentence processing is carried out to text, the sentence of processing is arranged and preserved in order according in the position of text.