CN107885706A - A kind of system of data similarity detection - Google Patents

A kind of system of data similarity detection Download PDF

Info

Publication number
CN107885706A
CN107885706A CN201711077910.XA CN201711077910A CN107885706A CN 107885706 A CN107885706 A CN 107885706A CN 201711077910 A CN201711077910 A CN 201711077910A CN 107885706 A CN107885706 A CN 107885706A
Authority
CN
China
Prior art keywords
module
similarity
paper
data
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711077910.XA
Other languages
Chinese (zh)
Inventor
崔垒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Foshan Zhangyang Technology Co Ltd
Original Assignee
Foshan Zhangyang Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Foshan Zhangyang Technology Co Ltd filed Critical Foshan Zhangyang Technology Co Ltd
Priority to CN201711077910.XA priority Critical patent/CN107885706A/en
Publication of CN107885706A publication Critical patent/CN107885706A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

A kind of system of data similarity detection provided by the invention, including input data module, data register module, database module, screening module, pretreatment module, pre-detection module, result output module and computing module, acted on by each function synergic of each module, the paper of each colleges and universities is avoided to plagiarize getting worse phenomenon, undergraduate is improved for thesis attention degree, novelty and initiative of the former pupil to thesis are lifted, while can prevent from excessively quoting other people document.

Description

A kind of system of data similarity detection
Technical field
The present invention relates to Data Detection field, more particularly to a kind of system of data similarity detection.
Background technology
Information is emerged as due to the network information technology, the communication technology, digital library and digital disributed media comprehensively Resource can efficiently be shared with propagating the necessary condition provided.These emerging network resources, which become in contemporary society, to be stored The medium of information resources.Due to being stored in online information resources stored using platform that is open, mutually enjoying mostly so that people Can be more convenient, quickly, effective acquisition and propagate these information.But in the same time, it provides a reason for plagiarism person The hotbed thought so that plagiarism becomes more succinctly, conveniently.On being accused of plagiarizing, usurping the new of others' fruit of labour in current social It is of common occurrence to hear report.These ways of these people are that the specification that is offensive to morality even forms crime, tight in this intellectual property In the epoch of lattice protection, the condemnation and thinking of people are triggered.Plagiarize and bring significant negative impact to society, just as one in society Moth, China is limited to scientific knowledge and the sound development of technical research, causes the concern and attention of society.Plagiarize institute Caused by serious consequence it is self-evident.First, the right of personality of copyright for having invaded author is plagiarized.Right of personality of copyright is provided in law The exlusive right enjoyed by original work of author, mainly protect publication right of the author to works, the power of amendment.This is pair The affirmative of the fruit of labour of scientific research personnel and protection, so as to excite more personnel to carry out the innovation research of knowledge and technology, Promote scientific and technological progress.Secondly, the property right of an author for also having invaded author is plagiarized.Works are the wisdom crystallizations that author creates hard, Corresponding social remuneration is naturally obtained, while is also a kind of value accreditation to author's contribution, and is also to promote society The power constantly to advance.
Behavior for plagiarism, we are while disclosing, condemning and condemning, it should also are added by sound mechanism It is strong to take precautions against, avoid having occurred and that the digital product of cribbing is issued in digital media.
At the same time, the paper of each colleges and universities plagiarizes phenomenon getting worse, and undergraduate does not have enough weights for thesis Depending on having had a strong impact on the initiative of graduates, innovative and authenticity.Therefore the detection of urgent need data similarity is System, innovation ability of the former pupil to thesis is improved, prevent from excessively quoting the phenomenon for using for reference other people document.
The content of the invention
It is an object of the present invention to provide a kind of system of data similarity detection, the paper of each colleges and universities is avoided to plagiarize Getting worse phenomenon, undergraduate is improved for thesis attention degree, novelty of the lifting former pupil to thesis And initiative, while can prevent from excessively quoting other people document.
A kind of system of data similarity detection, input data module, data register module, database module, screening mould Block, pretreatment module, pre-detection module, result output module and computing module;
The input data module is connected with the data register module;
The data register module is connected with the database;
The database module is connected with the screening module;
The screening module is connected with the pretreatment module;The screening module is connected with the computing module;
The pretreatment module is connected with the pre-detection module;
The computing module is connected with the output module;
The computing module include Text Pretreatment module, Chinese word segmentation module, improved similarity calculation module and Text similarity computing;
The Text Pretreatment module is connected with the Chinese word segmentation module;
The Chinese word segmentation module is connected with the improved similarity calculation module;
The improved similarity calculation module is connected with the Text similarity computing.
Specifically, the database module is paper database module.
Specifically, it is characterised in that the input data module and the data register module operate completion by keeper.
Specifically, the screening module is used to obtain the paper related to pre- duplicate checking paper, filters out most paper.
Specifically, the related paper of the pre- duplicate checking paper includes subject, specialty, thesis topic and paper keyword etc. Content.
Specifically, the computing module is used for the sentence similarity for calculating paper.
Specifically, the computing module is used for the sentence similarity for calculating paper.
Specifically, the Text Pretreatment module carries out subordinate sentence processing to text, and the sentence of processing is according in the position of text Put and arrange and preserve in order.
As seen through the above technical solutions:A kind of system of data similarity detection provided by the invention, including input number According to module, data register module, database module, screening module, pretreatment module, pre-detection module, result output module with And computing module, acted on by each function synergic of each module, avoid the paper of each colleges and universities from plagiarizing getting worse phenomenon, improve undergraduate course Life lifts novelty and initiative of the former pupil to thesis, while can prevent for thesis attention degree Degree quotes other people document.
Brief description of the drawings
Some specific embodiments of the present invention are described in detail by way of example, and not by way of limitation with reference to the accompanying drawings hereinafter. Identical reference denotes same or similar part or part in accompanying drawing.It should be appreciated by those skilled in the art that these What accompanying drawing was not necessarily drawn to scale.
Fig. 1 is the structure chart for the system that a kind of data similarity of the embodiment of the present application detects.
Fig. 2 is the computing module structure chart for the system that a kind of data similarity of the embodiment of the present application detects.
Reference numeral explanation:1st, input data module;2nd, data register module;3rd, database module;4th, screening module;5、 Pretreatment module;6th, pre-detection module;7th, result output module;8th, computing module;81st, Text Pretreatment module;82nd, Chinese point Word module;83rd, improved similarity calculation module;84th, Text similarity computing.
Embodiment
This below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out it is clear, Complete description, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.Base Embodiment in the present invention, those of ordinary skill in the art obtained under the premise of creative work is not made it is all its His embodiment, belongs to the scope of protection of the invention.
Refer to Fig. 1 and Fig. 2, a kind of system of data similarity detection, input data module 1, data register module 2, Database module 3, screening module 4, pretreatment module 5, pre-detection module 6, result output module 7 and computing module 8;
The input data module is connected with the data register module 2;
The data register module 2 is connected with the database 3;
The database module 3 is connected with the screening module 4;
The screening module 4 is connected with the pretreatment module 5;The screening module 4 is connected with the computing module 8;
The pretreatment module 5 is connected with the pre-detection module 6;
The computing module 8 is connected with the output module 7;
The computing module 8 includes Text Pretreatment module 81, Chinese word segmentation module 82, improved similarity calculation module 83 and Text similarity computing 84;
The Text Pretreatment module 81 is connected with the Chinese word segmentation module 82;
The Chinese word segmentation module 82 is connected with the improved similarity calculation module 83;
The improved similarity calculation module 83 is connected with the Text similarity computing 84.
The input data module 1 and the data register module 2 only have keeper to be imported in just having permission toward database Paper establishes local paper database.
Data subject, specialty, thesis topic, paper keyword, paper as corresponding to paper are former in the database module 3 Text composition after text, the processing of paper subordinate sentence.
The main purpose of screening module 4 is to filter out most paper document in back-end data, is obtained and in duplicate checking The related paper of document, reaches the purpose for improving running efficiency of system.This module core part is described as follows:The first step:It is first Subject, specialty, thesis topic, the paper keyword of the paper of pre- duplicate checking are first extracted, assigns different weight αs, β, γ, δ respectively, Then the paper of identical subject in database is filtered out according to subject.Second step:Using ICTCLAS Chinese word segmentation systems to the above Four character strings are segmented, extract Feature Words after obtain Feature Words, the subject of the paper screened respectively with the first step, Specialty, thesis topic, paper keyword carry out Similarity Measure and respectively obtain similarity SimA, SimB, SimC, SimD.
3rd step:Weighting is handled the similarity of three respectively to more than, obtains this four total similarities, calculation formula is such as Under:Sim (A, B, C, D)=α × SimA+ β × SimB+ γ × SimC+ δ × SimD.When calculating total similarity Sim (A, B, C, D) when being more than given threshold value 0.7, then it is assumed that this two papers are possible to similar, and then this paper text is carried out at subordinate sentence Reason, then carries out the Similarity Measure of sentence, and records and preserve similar sentence.Here subject weight α=0.2, professional weight beta =0.1, thesis topic weight γ=0.4, keyword weight δ=0.3.
The Text Pretreatment module 81 mainly carries out subordinate sentence processing to text, and the sentence of processing is according in text Position arranges and preserved in order, to carry out processing operation below.
The Chinese word segmentation module 82 is calculated the similarity of Chinese document, and some occurred in all sentences are special The non-character for depositing Chinese such as symbol, English, numeral, can select to be filtered.Then ICTCLAS Chinese word segmentation systems pair are used Sentence is segmented.Some stop words are removed after participle.
The method that the improved similarity calculation module 83 is combined based on semantic dependent tree and improved edit-distance, come Calculate the similarity between sentence.
The Text similarity computing 84 divides the document into sentence, then using based on semantic dependent tree and improve editor away from From the similarity that algorithm obtains sentence, when the similarity of sentence is more than threshold value, we then think that two sentences are similar.Then unite Count similar sentence sum and account for the total percentage of sentence that the text of preliminary examination is divided into, and then obtain the similarity of text, Ran Houyong The result of the form outgoing inspection of Excel forms.Pre- duplicate checking document overall similarity calculation formula is as follows:SimAll=n/m × 100%, SimAll represent the text similarity of pre- duplicate checking document meter;N represents the Similarity Measure that pre- duplicate checking document passes through sentence The similar sentence sum counted afterwards;M represents the sentence sum that pre- duplicate checking document obtains after being handled by subordinate sentence.Because n begins It is less than eventually equal to m, so the result of Similarity Measure is on [0%, 100%] section, when similar sentence sum is equal to During sentence sum (during n=m), then the Documents Similarity calculated is 100%;When similar sentence sum is 0, then calculate The Documents Similarity gone out is 0%.
It is an object of the present invention to provide a kind of system of data similarity detection, the paper of each colleges and universities is avoided to plagiarize Getting worse phenomenon, undergraduate is improved for thesis attention degree, novelty of the lifting former pupil to thesis And initiative, while can prevent from excessively quoting other people document.
Further, the database module 3 is paper database module.
Further, it is characterised in that the input data module 1 and the data register module 2 are operated by keeper Complete.
Further, the screening module 4 is used to obtain the paper related to pre- duplicate checking paper, filters out most opinion Text.
Further, the related paper of the pre- duplicate checking paper includes subject, specialty, thesis topic and paper keyword Etc. content.
Further, the computing module 8 is used for the sentence similarity for calculating paper.
Further, the computing module 8 is used for the sentence similarity for calculating paper.
Further, the Text Pretreatment module 81 carries out subordinate sentence processing to text, and the sentence of processing is according in text Position arrange and preserve in order.
As seen through the above technical solutions:A kind of system of data similarity detection provided by the invention, including input number According to module, data register module, database module, screening module, pretreatment module, pre-detection module, result output module with And computing module, acted on by each function synergic of each module, avoid the paper of each colleges and universities from plagiarizing getting worse phenomenon, improve undergraduate course Life lifts novelty and initiative of the former pupil to thesis, while can prevent for thesis attention degree Degree quotes other people document.
So far, although those skilled in the art will appreciate that detailed herein have shown and described multiple showing for the present invention Example property embodiment, still, still can be direct according to present disclosure without departing from the spirit and scope of the present invention It is determined that or derive many other variations or modifications for meeting the principle of the invention.Therefore, the scope of the present invention is understood that and recognized It is set to and covers other all these variations or modifications.

Claims (8)

1. a kind of system of data similarity detection, including:Input data module (1), data register module (2), database mould Block (3), screening module (4), pretreatment module (5), pre-detection module (6), result output module (7) and computing module (8);
The input data module is connected with the data register module (2);
The data register module (2) is connected with the database (3);
The database module (3) is connected with the screening module (4);
The screening module (4) is connected with the pretreatment module (5);The screening module (4) connects with the computing module (8) Connect;
The pretreatment module (5) is connected with the pre-detection module (6);
The computing module (8) is connected with the output module (7);
The computing module (8) includes Text Pretreatment module (81), Chinese word segmentation module (82), improved Similarity Measure mould Block (83) and Text similarity computing (84);
The Text Pretreatment module (81) is connected with the Chinese word segmentation module (82);
The Chinese word segmentation module (82) is connected with the improved similarity calculation module (83);
The improved similarity calculation module (83) is connected with the Text similarity computing (84).
2. the system of data similarity according to claim 1 detection, it is characterised in that the database module (3) is Paper database module.
3. the system of data similarity detection according to claim 1, it is characterised in that the input data module (1) Completion is operated by keeper with the data register module (2).
4. the system of data similarity detection according to claim 1, it is characterised in that the screening module (4) is used for The paper related to pre- duplicate checking paper is obtained, filters out most paper.
5. the system of data similarity detection according to claim 4, it is characterised in that the pre- duplicate checking paper correlation Paper includes the contents such as subject, specialty, thesis topic and paper keyword.
6. the system of data similarity detection according to claim 1, it is characterised in that the computing module (8) is used for Calculate the sentence similarity of paper.
7. the system of data similarity detection according to claim 1, it is characterised in that the computing module (8) is used for Calculate the sentence similarity of paper.
8. the system of data similarity detection according to claim 1, it is characterised in that the Text Pretreatment module (81) subordinate sentence processing is carried out to text, the sentence of processing is arranged and preserved in order according in the position of text.
CN201711077910.XA 2017-11-06 2017-11-06 A kind of system of data similarity detection Pending CN107885706A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711077910.XA CN107885706A (en) 2017-11-06 2017-11-06 A kind of system of data similarity detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711077910.XA CN107885706A (en) 2017-11-06 2017-11-06 A kind of system of data similarity detection

Publications (1)

Publication Number Publication Date
CN107885706A true CN107885706A (en) 2018-04-06

Family

ID=61778804

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711077910.XA Pending CN107885706A (en) 2017-11-06 2017-11-06 A kind of system of data similarity detection

Country Status (1)

Country Link
CN (1) CN107885706A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109685471A (en) * 2018-12-25 2019-04-26 胡森博 A kind of multilingual paper intelligence auditing system
CN111353031A (en) * 2020-02-27 2020-06-30 海南谊之脉科技有限公司 Thesis management method, server and system based on big data

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100788440B1 (en) * 2006-06-29 2007-12-24 중앙대학교 산학협력단 A document copy detection system based on plagiarism patterns
CN101369279A (en) * 2008-09-19 2009-02-18 江苏大学 Detection method for academic dissertation similarity based on computer searching system
CN101980196A (en) * 2010-10-25 2011-02-23 中国农业大学 Article comparison method and device
CN105302793A (en) * 2015-10-21 2016-02-03 南方电网科学研究院有限责任公司 Method for automatically evaluating scientific and technical literature novelty by utilizing computer
CN105701085A (en) * 2016-01-13 2016-06-22 湖南通远网络科技有限公司 Network duplicate checking method and system
CN105701076A (en) * 2016-01-13 2016-06-22 湖南通远网络科技有限公司 Thesis plagiarism detection method and system
CN106372202A (en) * 2016-08-31 2017-02-01 北京奇艺世纪科技有限公司 Text similarity calculation method and device
CN106528507A (en) * 2016-10-25 2017-03-22 中南林业科技大学 Chinese text similarity detection method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100788440B1 (en) * 2006-06-29 2007-12-24 중앙대학교 산학협력단 A document copy detection system based on plagiarism patterns
CN101369279A (en) * 2008-09-19 2009-02-18 江苏大学 Detection method for academic dissertation similarity based on computer searching system
CN101980196A (en) * 2010-10-25 2011-02-23 中国农业大学 Article comparison method and device
CN105302793A (en) * 2015-10-21 2016-02-03 南方电网科学研究院有限责任公司 Method for automatically evaluating scientific and technical literature novelty by utilizing computer
CN105701085A (en) * 2016-01-13 2016-06-22 湖南通远网络科技有限公司 Network duplicate checking method and system
CN105701076A (en) * 2016-01-13 2016-06-22 湖南通远网络科技有限公司 Thesis plagiarism detection method and system
CN106372202A (en) * 2016-08-31 2017-02-01 北京奇艺世纪科技有限公司 Text similarity calculation method and device
CN106528507A (en) * 2016-10-25 2017-03-22 中南林业科技大学 Chinese text similarity detection method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109685471A (en) * 2018-12-25 2019-04-26 胡森博 A kind of multilingual paper intelligence auditing system
CN111353031A (en) * 2020-02-27 2020-06-30 海南谊之脉科技有限公司 Thesis management method, server and system based on big data
CN111353031B (en) * 2020-02-27 2023-04-14 海南谊之脉科技有限公司 Thesis management method, server and system based on big data

Similar Documents

Publication Publication Date Title
Taj et al. Sentiment analysis of news articles: a lexicon based approach
Ray et al. Twitter sentiment analysis for product review using lexicon method
CN106708966A (en) Similarity calculation-based junk comment detection method
CN103631859A (en) Intelligent review expert recommending method for science and technology projects
CN110019792A (en) File classification method and device and sorter model training method
CN110781679B (en) News event keyword mining method based on associated semantic chain network
Gupta et al. Leveraging transfer learning techniques-bert, roberta, albert and distilbert for fake review detection
CN104915443A (en) Extraction method of Chinese Microblog evaluation object
CN106202065A (en) A kind of across language topic detecting method and system
Jusoh et al. Applying fuzzy sets for opinion mining
CN107885706A (en) A kind of system of data similarity detection
US20140143253A1 (en) Stochastic document clustering using rare features
CN104077274B (en) Method and device for extracting hot word phrases from document set
CN105117466A (en) Internet information screening system and method
CN105869058A (en) Method for user portrait extraction based on multilayer latent variable model
Ennaji et al. Social intelligence framework: Extracting and analyzing opinions for social CRM
CN112989791A (en) Duplication eliminating method, system and medium based on text information extraction result
Sadman et al. Understanding the pandemic through mining covid news using natural language processing
Sinno et al. Political ideology and polarization of policy positions: A multi-dimensional approach
Marchi et al. Assessing online sustainability communication of Italian cultural destinations–a web content mining approach
Priya et al. Entity resolution for high velocity streams using semantic measures
Kirn et al. Ridge count thresholding to uncover coordinated networks during onset of the Covid-19 pandemic
Ha et al. Lifelong learning for cross-domain vietnamese sentiment classification
Verma Opinion Mining On Rural Tourism In India-Qualitative Perspective
Pirnau et al. Analysis of the Energy Crisis in the Content of Users' Posts on Twitter

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180406

RJ01 Rejection of invention patent application after publication