CN107885706A - A kind of system of data similarity detection - Google Patents
A kind of system of data similarity detection Download PDFInfo
- Publication number
- CN107885706A CN107885706A CN201711077910.XA CN201711077910A CN107885706A CN 107885706 A CN107885706 A CN 107885706A CN 201711077910 A CN201711077910 A CN 201711077910A CN 107885706 A CN107885706 A CN 107885706A
- Authority
- CN
- China
- Prior art keywords
- module
- similarity
- paper
- data
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
A kind of system of data similarity detection provided by the invention, including input data module, data register module, database module, screening module, pretreatment module, pre-detection module, result output module and computing module, acted on by each function synergic of each module, the paper of each colleges and universities is avoided to plagiarize getting worse phenomenon, undergraduate is improved for thesis attention degree, novelty and initiative of the former pupil to thesis are lifted, while can prevent from excessively quoting other people document.
Description
Technical field
The present invention relates to Data Detection field, more particularly to a kind of system of data similarity detection.
Background technology
Information is emerged as due to the network information technology, the communication technology, digital library and digital disributed media comprehensively
Resource can efficiently be shared with propagating the necessary condition provided.These emerging network resources, which become in contemporary society, to be stored
The medium of information resources.Due to being stored in online information resources stored using platform that is open, mutually enjoying mostly so that people
Can be more convenient, quickly, effective acquisition and propagate these information.But in the same time, it provides a reason for plagiarism person
The hotbed thought so that plagiarism becomes more succinctly, conveniently.On being accused of plagiarizing, usurping the new of others' fruit of labour in current social
It is of common occurrence to hear report.These ways of these people are that the specification that is offensive to morality even forms crime, tight in this intellectual property
In the epoch of lattice protection, the condemnation and thinking of people are triggered.Plagiarize and bring significant negative impact to society, just as one in society
Moth, China is limited to scientific knowledge and the sound development of technical research, causes the concern and attention of society.Plagiarize institute
Caused by serious consequence it is self-evident.First, the right of personality of copyright for having invaded author is plagiarized.Right of personality of copyright is provided in law
The exlusive right enjoyed by original work of author, mainly protect publication right of the author to works, the power of amendment.This is pair
The affirmative of the fruit of labour of scientific research personnel and protection, so as to excite more personnel to carry out the innovation research of knowledge and technology,
Promote scientific and technological progress.Secondly, the property right of an author for also having invaded author is plagiarized.Works are the wisdom crystallizations that author creates hard,
Corresponding social remuneration is naturally obtained, while is also a kind of value accreditation to author's contribution, and is also to promote society
The power constantly to advance.
Behavior for plagiarism, we are while disclosing, condemning and condemning, it should also are added by sound mechanism
It is strong to take precautions against, avoid having occurred and that the digital product of cribbing is issued in digital media.
At the same time, the paper of each colleges and universities plagiarizes phenomenon getting worse, and undergraduate does not have enough weights for thesis
Depending on having had a strong impact on the initiative of graduates, innovative and authenticity.Therefore the detection of urgent need data similarity is
System, innovation ability of the former pupil to thesis is improved, prevent from excessively quoting the phenomenon for using for reference other people document.
The content of the invention
It is an object of the present invention to provide a kind of system of data similarity detection, the paper of each colleges and universities is avoided to plagiarize
Getting worse phenomenon, undergraduate is improved for thesis attention degree, novelty of the lifting former pupil to thesis
And initiative, while can prevent from excessively quoting other people document.
A kind of system of data similarity detection, input data module, data register module, database module, screening mould
Block, pretreatment module, pre-detection module, result output module and computing module;
The input data module is connected with the data register module;
The data register module is connected with the database;
The database module is connected with the screening module;
The screening module is connected with the pretreatment module;The screening module is connected with the computing module;
The pretreatment module is connected with the pre-detection module;
The computing module is connected with the output module;
The computing module include Text Pretreatment module, Chinese word segmentation module, improved similarity calculation module and
Text similarity computing;
The Text Pretreatment module is connected with the Chinese word segmentation module;
The Chinese word segmentation module is connected with the improved similarity calculation module;
The improved similarity calculation module is connected with the Text similarity computing.
Specifically, the database module is paper database module.
Specifically, it is characterised in that the input data module and the data register module operate completion by keeper.
Specifically, the screening module is used to obtain the paper related to pre- duplicate checking paper, filters out most paper.
Specifically, the related paper of the pre- duplicate checking paper includes subject, specialty, thesis topic and paper keyword etc.
Content.
Specifically, the computing module is used for the sentence similarity for calculating paper.
Specifically, the computing module is used for the sentence similarity for calculating paper.
Specifically, the Text Pretreatment module carries out subordinate sentence processing to text, and the sentence of processing is according in the position of text
Put and arrange and preserve in order.
As seen through the above technical solutions:A kind of system of data similarity detection provided by the invention, including input number
According to module, data register module, database module, screening module, pretreatment module, pre-detection module, result output module with
And computing module, acted on by each function synergic of each module, avoid the paper of each colleges and universities from plagiarizing getting worse phenomenon, improve undergraduate course
Life lifts novelty and initiative of the former pupil to thesis, while can prevent for thesis attention degree
Degree quotes other people document.
Brief description of the drawings
Some specific embodiments of the present invention are described in detail by way of example, and not by way of limitation with reference to the accompanying drawings hereinafter.
Identical reference denotes same or similar part or part in accompanying drawing.It should be appreciated by those skilled in the art that these
What accompanying drawing was not necessarily drawn to scale.
Fig. 1 is the structure chart for the system that a kind of data similarity of the embodiment of the present application detects.
Fig. 2 is the computing module structure chart for the system that a kind of data similarity of the embodiment of the present application detects.
Reference numeral explanation:1st, input data module;2nd, data register module;3rd, database module;4th, screening module;5、
Pretreatment module;6th, pre-detection module;7th, result output module;8th, computing module;81st, Text Pretreatment module;82nd, Chinese point
Word module;83rd, improved similarity calculation module;84th, Text similarity computing.
Embodiment
This below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out it is clear,
Complete description, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.Base
Embodiment in the present invention, those of ordinary skill in the art obtained under the premise of creative work is not made it is all its
His embodiment, belongs to the scope of protection of the invention.
Refer to Fig. 1 and Fig. 2, a kind of system of data similarity detection, input data module 1, data register module 2,
Database module 3, screening module 4, pretreatment module 5, pre-detection module 6, result output module 7 and computing module 8;
The input data module is connected with the data register module 2;
The data register module 2 is connected with the database 3;
The database module 3 is connected with the screening module 4;
The screening module 4 is connected with the pretreatment module 5;The screening module 4 is connected with the computing module 8;
The pretreatment module 5 is connected with the pre-detection module 6;
The computing module 8 is connected with the output module 7;
The computing module 8 includes Text Pretreatment module 81, Chinese word segmentation module 82, improved similarity calculation module
83 and Text similarity computing 84;
The Text Pretreatment module 81 is connected with the Chinese word segmentation module 82;
The Chinese word segmentation module 82 is connected with the improved similarity calculation module 83;
The improved similarity calculation module 83 is connected with the Text similarity computing 84.
The input data module 1 and the data register module 2 only have keeper to be imported in just having permission toward database
Paper establishes local paper database.
Data subject, specialty, thesis topic, paper keyword, paper as corresponding to paper are former in the database module 3
Text composition after text, the processing of paper subordinate sentence.
The main purpose of screening module 4 is to filter out most paper document in back-end data, is obtained and in duplicate checking
The related paper of document, reaches the purpose for improving running efficiency of system.This module core part is described as follows:The first step:It is first
Subject, specialty, thesis topic, the paper keyword of the paper of pre- duplicate checking are first extracted, assigns different weight αs, β, γ, δ respectively,
Then the paper of identical subject in database is filtered out according to subject.Second step:Using ICTCLAS Chinese word segmentation systems to the above
Four character strings are segmented, extract Feature Words after obtain Feature Words, the subject of the paper screened respectively with the first step,
Specialty, thesis topic, paper keyword carry out Similarity Measure and respectively obtain similarity SimA, SimB, SimC, SimD.
3rd step:Weighting is handled the similarity of three respectively to more than, obtains this four total similarities, calculation formula is such as
Under:Sim (A, B, C, D)=α × SimA+ β × SimB+ γ × SimC+ δ × SimD.When calculating total similarity Sim (A, B,
C, D) when being more than given threshold value 0.7, then it is assumed that this two papers are possible to similar, and then this paper text is carried out at subordinate sentence
Reason, then carries out the Similarity Measure of sentence, and records and preserve similar sentence.Here subject weight α=0.2, professional weight beta
=0.1, thesis topic weight γ=0.4, keyword weight δ=0.3.
The Text Pretreatment module 81 mainly carries out subordinate sentence processing to text, and the sentence of processing is according in text
Position arranges and preserved in order, to carry out processing operation below.
The Chinese word segmentation module 82 is calculated the similarity of Chinese document, and some occurred in all sentences are special
The non-character for depositing Chinese such as symbol, English, numeral, can select to be filtered.Then ICTCLAS Chinese word segmentation systems pair are used
Sentence is segmented.Some stop words are removed after participle.
The method that the improved similarity calculation module 83 is combined based on semantic dependent tree and improved edit-distance, come
Calculate the similarity between sentence.
The Text similarity computing 84 divides the document into sentence, then using based on semantic dependent tree and improve editor away from
From the similarity that algorithm obtains sentence, when the similarity of sentence is more than threshold value, we then think that two sentences are similar.Then unite
Count similar sentence sum and account for the total percentage of sentence that the text of preliminary examination is divided into, and then obtain the similarity of text, Ran Houyong
The result of the form outgoing inspection of Excel forms.Pre- duplicate checking document overall similarity calculation formula is as follows:SimAll=n/m ×
100%, SimAll represent the text similarity of pre- duplicate checking document meter;N represents the Similarity Measure that pre- duplicate checking document passes through sentence
The similar sentence sum counted afterwards;M represents the sentence sum that pre- duplicate checking document obtains after being handled by subordinate sentence.Because n begins
It is less than eventually equal to m, so the result of Similarity Measure is on [0%, 100%] section, when similar sentence sum is equal to
During sentence sum (during n=m), then the Documents Similarity calculated is 100%;When similar sentence sum is 0, then calculate
The Documents Similarity gone out is 0%.
It is an object of the present invention to provide a kind of system of data similarity detection, the paper of each colleges and universities is avoided to plagiarize
Getting worse phenomenon, undergraduate is improved for thesis attention degree, novelty of the lifting former pupil to thesis
And initiative, while can prevent from excessively quoting other people document.
Further, the database module 3 is paper database module.
Further, it is characterised in that the input data module 1 and the data register module 2 are operated by keeper
Complete.
Further, the screening module 4 is used to obtain the paper related to pre- duplicate checking paper, filters out most opinion
Text.
Further, the related paper of the pre- duplicate checking paper includes subject, specialty, thesis topic and paper keyword
Etc. content.
Further, the computing module 8 is used for the sentence similarity for calculating paper.
Further, the computing module 8 is used for the sentence similarity for calculating paper.
Further, the Text Pretreatment module 81 carries out subordinate sentence processing to text, and the sentence of processing is according in text
Position arrange and preserve in order.
As seen through the above technical solutions:A kind of system of data similarity detection provided by the invention, including input number
According to module, data register module, database module, screening module, pretreatment module, pre-detection module, result output module with
And computing module, acted on by each function synergic of each module, avoid the paper of each colleges and universities from plagiarizing getting worse phenomenon, improve undergraduate course
Life lifts novelty and initiative of the former pupil to thesis, while can prevent for thesis attention degree
Degree quotes other people document.
So far, although those skilled in the art will appreciate that detailed herein have shown and described multiple showing for the present invention
Example property embodiment, still, still can be direct according to present disclosure without departing from the spirit and scope of the present invention
It is determined that or derive many other variations or modifications for meeting the principle of the invention.Therefore, the scope of the present invention is understood that and recognized
It is set to and covers other all these variations or modifications.
Claims (8)
1. a kind of system of data similarity detection, including:Input data module (1), data register module (2), database mould
Block (3), screening module (4), pretreatment module (5), pre-detection module (6), result output module (7) and computing module (8);
The input data module is connected with the data register module (2);
The data register module (2) is connected with the database (3);
The database module (3) is connected with the screening module (4);
The screening module (4) is connected with the pretreatment module (5);The screening module (4) connects with the computing module (8)
Connect;
The pretreatment module (5) is connected with the pre-detection module (6);
The computing module (8) is connected with the output module (7);
The computing module (8) includes Text Pretreatment module (81), Chinese word segmentation module (82), improved Similarity Measure mould
Block (83) and Text similarity computing (84);
The Text Pretreatment module (81) is connected with the Chinese word segmentation module (82);
The Chinese word segmentation module (82) is connected with the improved similarity calculation module (83);
The improved similarity calculation module (83) is connected with the Text similarity computing (84).
2. the system of data similarity according to claim 1 detection, it is characterised in that the database module (3) is
Paper database module.
3. the system of data similarity detection according to claim 1, it is characterised in that the input data module (1)
Completion is operated by keeper with the data register module (2).
4. the system of data similarity detection according to claim 1, it is characterised in that the screening module (4) is used for
The paper related to pre- duplicate checking paper is obtained, filters out most paper.
5. the system of data similarity detection according to claim 4, it is characterised in that the pre- duplicate checking paper correlation
Paper includes the contents such as subject, specialty, thesis topic and paper keyword.
6. the system of data similarity detection according to claim 1, it is characterised in that the computing module (8) is used for
Calculate the sentence similarity of paper.
7. the system of data similarity detection according to claim 1, it is characterised in that the computing module (8) is used for
Calculate the sentence similarity of paper.
8. the system of data similarity detection according to claim 1, it is characterised in that the Text Pretreatment module
(81) subordinate sentence processing is carried out to text, the sentence of processing is arranged and preserved in order according in the position of text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711077910.XA CN107885706A (en) | 2017-11-06 | 2017-11-06 | A kind of system of data similarity detection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711077910.XA CN107885706A (en) | 2017-11-06 | 2017-11-06 | A kind of system of data similarity detection |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107885706A true CN107885706A (en) | 2018-04-06 |
Family
ID=61778804
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711077910.XA Pending CN107885706A (en) | 2017-11-06 | 2017-11-06 | A kind of system of data similarity detection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107885706A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109685471A (en) * | 2018-12-25 | 2019-04-26 | 胡森博 | A kind of multilingual paper intelligence auditing system |
CN111353031A (en) * | 2020-02-27 | 2020-06-30 | 海南谊之脉科技有限公司 | Thesis management method, server and system based on big data |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100788440B1 (en) * | 2006-06-29 | 2007-12-24 | 중앙대학교 산학협력단 | A document copy detection system based on plagiarism patterns |
CN101369279A (en) * | 2008-09-19 | 2009-02-18 | 江苏大学 | Detection method for academic dissertation similarity based on computer searching system |
CN101980196A (en) * | 2010-10-25 | 2011-02-23 | 中国农业大学 | Article comparison method and device |
CN105302793A (en) * | 2015-10-21 | 2016-02-03 | 南方电网科学研究院有限责任公司 | Method for automatically evaluating scientific and technical literature novelty by utilizing computer |
CN105701085A (en) * | 2016-01-13 | 2016-06-22 | 湖南通远网络科技有限公司 | Network duplicate checking method and system |
CN105701076A (en) * | 2016-01-13 | 2016-06-22 | 湖南通远网络科技有限公司 | Thesis plagiarism detection method and system |
CN106372202A (en) * | 2016-08-31 | 2017-02-01 | 北京奇艺世纪科技有限公司 | Text similarity calculation method and device |
CN106528507A (en) * | 2016-10-25 | 2017-03-22 | 中南林业科技大学 | Chinese text similarity detection method and device |
-
2017
- 2017-11-06 CN CN201711077910.XA patent/CN107885706A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100788440B1 (en) * | 2006-06-29 | 2007-12-24 | 중앙대학교 산학협력단 | A document copy detection system based on plagiarism patterns |
CN101369279A (en) * | 2008-09-19 | 2009-02-18 | 江苏大学 | Detection method for academic dissertation similarity based on computer searching system |
CN101980196A (en) * | 2010-10-25 | 2011-02-23 | 中国农业大学 | Article comparison method and device |
CN105302793A (en) * | 2015-10-21 | 2016-02-03 | 南方电网科学研究院有限责任公司 | Method for automatically evaluating scientific and technical literature novelty by utilizing computer |
CN105701085A (en) * | 2016-01-13 | 2016-06-22 | 湖南通远网络科技有限公司 | Network duplicate checking method and system |
CN105701076A (en) * | 2016-01-13 | 2016-06-22 | 湖南通远网络科技有限公司 | Thesis plagiarism detection method and system |
CN106372202A (en) * | 2016-08-31 | 2017-02-01 | 北京奇艺世纪科技有限公司 | Text similarity calculation method and device |
CN106528507A (en) * | 2016-10-25 | 2017-03-22 | 中南林业科技大学 | Chinese text similarity detection method and device |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109685471A (en) * | 2018-12-25 | 2019-04-26 | 胡森博 | A kind of multilingual paper intelligence auditing system |
CN111353031A (en) * | 2020-02-27 | 2020-06-30 | 海南谊之脉科技有限公司 | Thesis management method, server and system based on big data |
CN111353031B (en) * | 2020-02-27 | 2023-04-14 | 海南谊之脉科技有限公司 | Thesis management method, server and system based on big data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Taj et al. | Sentiment analysis of news articles: a lexicon based approach | |
Ray et al. | Twitter sentiment analysis for product review using lexicon method | |
CN106708966A (en) | Similarity calculation-based junk comment detection method | |
CN103631859A (en) | Intelligent review expert recommending method for science and technology projects | |
CN110019792A (en) | File classification method and device and sorter model training method | |
CN110781679B (en) | News event keyword mining method based on associated semantic chain network | |
Gupta et al. | Leveraging transfer learning techniques-bert, roberta, albert and distilbert for fake review detection | |
CN104915443A (en) | Extraction method of Chinese Microblog evaluation object | |
CN106202065A (en) | A kind of across language topic detecting method and system | |
Jusoh et al. | Applying fuzzy sets for opinion mining | |
CN107885706A (en) | A kind of system of data similarity detection | |
US20140143253A1 (en) | Stochastic document clustering using rare features | |
CN104077274B (en) | Method and device for extracting hot word phrases from document set | |
CN105117466A (en) | Internet information screening system and method | |
CN105869058A (en) | Method for user portrait extraction based on multilayer latent variable model | |
Ennaji et al. | Social intelligence framework: Extracting and analyzing opinions for social CRM | |
CN112989791A (en) | Duplication eliminating method, system and medium based on text information extraction result | |
Sadman et al. | Understanding the pandemic through mining covid news using natural language processing | |
Sinno et al. | Political ideology and polarization of policy positions: A multi-dimensional approach | |
Marchi et al. | Assessing online sustainability communication of Italian cultural destinations–a web content mining approach | |
Priya et al. | Entity resolution for high velocity streams using semantic measures | |
Kirn et al. | Ridge count thresholding to uncover coordinated networks during onset of the Covid-19 pandemic | |
Ha et al. | Lifelong learning for cross-domain vietnamese sentiment classification | |
Verma | Opinion Mining On Rural Tourism In India-Qualitative Perspective | |
Pirnau et al. | Analysis of the Energy Crisis in the Content of Users' Posts on Twitter |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180406 |
|
RJ01 | Rejection of invention patent application after publication |