CN105701086A - Method and system for detecting literature through sliding window - Google Patents

Method and system for detecting literature through sliding window Download PDF

Info

Publication number
CN105701086A
CN105701086A CN201610020696.3A CN201610020696A CN105701086A CN 105701086 A CN105701086 A CN 105701086A CN 201610020696 A CN201610020696 A CN 201610020696A CN 105701086 A CN105701086 A CN 105701086A
Authority
CN
China
Prior art keywords
participle
document
rwv
amount
eigenvalue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610020696.3A
Other languages
Chinese (zh)
Other versions
CN105701086B (en
Inventor
夏峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201610020696.3A priority Critical patent/CN105701086B/en
Publication of CN105701086A publication Critical patent/CN105701086A/en
Application granted granted Critical
Publication of CN105701086B publication Critical patent/CN105701086B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a system for detecting literature through a sliding window. The system comprises a comparison library used for collecting materials, a participle library used for collecting participles and corresponding part of speech, a word segmentation module used for carrying out word segmentation, a participle characteristic value generation module used for generating a participle syntactic character value, a participle free vector dimension determination module used for determining participle free vector dimension, a participle simplified vector dimension generation module used for generating participle simplified vector dimension, a participle characteristic vector generation module used for generating a participle characteristic vector, a to-be-authenticated document word segmentation module used for carrying out word segmentation on a to-be-authenticated document to obtain a word segmentation result, a to-be-authenticated document participle free vector dimension determination module used for determining participle free vector dimension, a to-be-authenticated document participle simplified vector dimension generation module used for generating to-be-authenticated document participle simplified vector dimension, and a to-be-authenticated document participle characteristic vector generation module used for generating a to-be-authenticated document participle characteristic vector; and similarity comparison is carried out.

Description

A kind of sliding window document detection method and system
Technical field
The invention belongs to text detection field, particularly relate to a kind of sliding window document detection method and system。
Background technology
Paper is plagiarized detection and is referred to the content of text judging whether a certain section paper is accused of plagiarizing other one or more documents。But it is not fully equivalent to replicate owing to plagiarizing, but replaces or translate the multiple means such as foreign language document be accused of plagiarizing the content of text of other documents possibly through certain semantic transforms, synonym。
At present, paper plagiarizes detection technique mainly two kinds of methods: one is by fingerprint recognition detection method, and one is by based on paragraph word frequency statistics detection method in text。So-called fingerprint recognition refers to extracts some data characteristics strings being called fingerprint from the source text content submitted to, judges whether a certain section document has been plagiarized other documents according to the identical rate of fingerprint。So-called paragraph word frequency statistics detection method refers to that the text to submitting to carries out participle, by adding up the frequency of occurrences of each paragraph in text, after setting a threshold value, each array of each array of text to be checked Yu query text is compared, finally judged whether to plagiarism according to this index。There is the problems such as a degree of discrimination rate is low, inefficient in said method of the prior art。
Summary of the invention
For overcoming above-mentioned the deficiencies in the prior art, the invention provides a kind of sliding window document detection method and system。
Wherein, described sliding window document retrieval system comprises comparison database, for including with the material comparing object;Participle storehouse, is used for including participle and corresponding part of speech;Participle storehouse carries out unique number for each participle, uses W_ID to represent a certain participle unique number in participle storehouse;Word-dividing mode, for each material carries out participle, and preserves word segmentation result to comparison database;Participle eigenvalue generation module adds up the quantity that each participle occurs in corresponding material, generates the participle part of speech eigenvalue that each participle is corresponding;Participle free vector dimension determines that module determines participle free vector dimension WFV according to the word segmentation result of material;Described participle free vector dimension WFV is equal to the quantity of the different participles obtained after specific material is carried out participle;Vector dimension generation module simplified in participle, generates participle and simplifies vector dimension RWV;Participle feature vector generation module, extracts participle described in each material and simplifies vector dimension RWV characteristic of correspondence value generation participle characteristic vector W VE_RWV;User's access mode detection module, is used for pointing out user to upload document to be identified;User detects mode decision module, is used for judging that active user detects pattern when being common plagiarism qualification pattern, and document word-dividing mode to be identified, for document to be identified is carried out participle, obtains word segmentation result;Document participle free vector dimension to be identified determines module, it is determined that participle free vector dimension WFV_TBI;Vector dimension generation module simplified in document participle to be identified, generates document participle to be identified and simplifies vector dimension RWV_TBI;Document participle feature vector generation module to be identified, generates document participle characteristic vector W VE_RWV_TBI to be identified;User detects mode decision module and judges that active user detects pattern when being common plagiarism qualification pattern, carries out similarity comparison;After document to be identified and all materials have contrasted, extract all doubtful materials, document to be identified and doubtful material are contrasted further。
Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention, and can be practiced according to the content of description, below with presently preferred embodiments of the present invention and coordinate accompanying drawing describe in detail as after。
Accompanying drawing explanation
Fig. 1 illustrates the block diagram of sliding window document retrieval system according to an embodiment of the invention;
Fig. 2 illustrates sliding window detection method according to an embodiment of the invention。
Detailed description of the invention
For further setting forth that the present invention reaches technological means and effect that predetermined goal of the invention is taked, below in conjunction with accompanying drawing and preferred embodiment, to the system and method detailed description of the invention proposed according to the present invention, feature and effect thereof, describe in detail as after。In the following description, what different " embodiments " or " embodiment " referred to is not necessarily same embodiment。Additionally, special characteristic in one or more embodiment, structure or feature can be combined by any suitable form。
As it is shown in figure 1, the sliding window document retrieval system (calling system in the following text) of the present invention comprises material subsystem;User subsystem;Doubtful story extraction subsystem;Contrast subsystem, wherein said material subsystem, for preparing for the material plagiarizing detection contrast;User subsystem, user manages user login information, and determines user's writing style;Doubtful story extraction subsystem, for extracting the doubtful material with document to be identified from comparison database;Contrast subsystem, for doubtful material and document to be identified being contrasted, generates comparison report。
According to a specific embodiment of the present invention, material subsystem may further include: comparison database;Participle storehouse, comprises synonym near synonym storehouse and middle foreign language thesaurus in participle storehouse;Word-dividing mode;Participle group module;Middle foreign language participle group module;Participle parts of speech classification module;Participle group parts of speech classification module;Middle foreign language participle group parts of speech classification module;Participle eigenvalue generation module;Participle stack features value generation module;Middle foreign language participle stack features value generation module;Participle tightening coefficient generation module;Participle group tightening coefficient generation module;Middle foreign language participle group tightening coefficient generation module;Participle tightening coefficient feature vector generation module;Participle group tightening coefficient feature vector generation module;Middle foreign language participle group tightening coefficient feature vector generation module;Participle free vector dimension determines module;Participle group free vector dimension determines module;Middle foreign language participle group free vector dimension determines module;Vector dimension generation module simplified in participle;Participle group simplifies vector dimension generation module;Middle foreign language participle group simplifies vector dimension generation module;Participle feature vector generation module;Participle stack features vector generation module;And one or more in middle foreign language participle stack features vector generation module。
According to a specific embodiment of the present invention, user subsystem may further include: user's access mode detection module;User detects mode decision module;User's writing style test module;Test picture character Expressive Features value generation module;Test article word Expressive Features value generation module;Test picture character Expressive Features vector generation module;Test article word Expressive Features vector generation module;Test picture reference characteristic vector generation module;Test article reference characteristic vector generation module;User test picture character Expressive Features value generation module;User test picture character Expressive Features vector generation module;User's picture writing style feature vector generation module;User test article word Expressive Features value generation module;User test article word Expressive Features vector generation module;User's article writing style and features vector generation module;User's writing style feature vector generation module;Pending file characteristics value generation module;Pending file characteristics value tag vector generation module;User's writing style similarity calculation module;User's writing style judge module;One or more in user's writing style structural auxiliary word judge module。
According to a specific embodiment of the present invention, doubtful story extraction subsystem may further include: document word-dividing mode to be identified;Document participle group module to be identified;Foreign language participle group module in document to be identified;Document participle parts of speech classification module to be identified;Document participle group parts of speech classification module to be identified;Foreign language participle group parts of speech classification module in document to be identified;Document participle eigenvalue generation module to be identified;Document participle stack features value generation module to be identified;Foreign language participle stack features value generation module in document to be identified;Document participle tightening coefficient generation module to be identified;Document participle group tightening coefficient generation module to be identified;Foreign language participle group tightening coefficient generation module in document to be identified;Document participle tightening coefficient feature vector generation module to be identified;Document participle group tightening coefficient feature vector generation module to be identified;Foreign language participle group tightening coefficient feature vector generation module in document to be identified;Document participle free vector dimension to be identified determines module;Document participle group free vector dimension to be identified determines module;In document to be identified, foreign language participle group free vector dimension determines module;Vector dimension generation module simplified in document participle to be identified;Document participle group to be identified simplifies vector dimension generation module;In document to be identified, foreign language participle group simplifies vector dimension generation module;Document participle feature vector generation module to be identified;Document participle stack features vector generation module to be identified;Foreign language participle stack features vector generation module in document to be identified;File characteristics vector adjusting module to be identified;Material characteristic vector adjusting module;Common plagiarism identifies that similarity calculation module, extension plagiarize qualification similarity calculation module;Multilingual plagiarism identifies similarity calculation module;Document tightening coefficient statistical module to be identified;Material tightening coefficient statistical module;Formulas Extraction module;Formula decomposing module;One or more in tightening coefficient doubtful story extraction module。
According to a specific embodiment of the present invention, contrast subsystem may further include: sliding window arranges module;Sliding window contrast module and comparison report generation module。
In a specific embodiment party according to the present invention, described system includes comparison database, for including with the material comparing object。Described comparison database farther includes the word banks such as books storehouse, paper storehouse, patent database, formula storehouse, proverb common saying storehouse, proverb storehouse, famous sayings of famous figures storehouse, poem storehouse。Wherein, books storehouse is for including the books of public publication;Paper storehouse is used for including journal article, meeting paper, academic dissertation etc.;Patent database is used for including disclosure etc.。When including material, it is necessary to preserve the source of described material further, for instance the publication date of books, publishing house, author, book number etc.;The date issued of journal article, corresponding the periodical name of periodical, issue, author etc.;The meeting title of meeting paper, Meeting Held place, the Meeting Held date, author etc.;The school of academic dissertation, graduation time, degree grade, author etc.;According to the quarry information included, those skilled in the art can uniquely obtain described material。Preferably, the material that comparison database is included is not limited to Chinese material, further comprises foreign language material。Comparison database also needs to periodically or non-periodically safeguard after setting up, and supplements books, journal article, meeting paper, academic dissertation and the disclosure etc. that increase newly。Proverb common saying storehouse is for being embodied in the materials such as sentence wide-spread between network or masses, phrase。Famous sayings of famous figures storehouse is used for including famous sayings of famous figures material, and poem storehouse is used for including the materials such as poem, word, song, tax。The purpose setting up proverb common saying storehouse, famous sayings of famous figures storehouse, poem storehouse etc. in comparison database further is the material scope of object as a comparison to be further expanded from traditional books, paper, patent file etc., improves and plagiarizes the comprehensive of detection。Those skilled in the art know, and comparison database can also include other kinds of material further, does not repeat them here。
Preferably, comparison database, when including material, is classified according to material art。According to a specific embodiment of the present invention, field designation can adopt the classification in Chinese library taxonomy, described Chinese library taxonomy is totally 5 basic categories, 22 big classes, adopt the mixing number that Chinese phonetic alphabet is combined with Arabic numerals, represent a big class with a letter, alphabetically reflect the order of big class, mark by numeral after letter。Such as, A1 represents Marx, Engels's works, and K6 represents Oceania history, and TN represents electronic technology, communication technology。Develop for applicable industrial technology, two grades of classifications of industrial technology are adopted biliteral。Those skilled in the art know, it is also possible to adopt other taxonomic hierarchies that material is carried out field designation。
Preferably, the material included, when including material, is indexed respectively by comparison database according to the mode of title, author, summary and text。For setting up incidence relation between the title of each material, author, summary and text each several part, the remainder of same material namely can be obtained by any portion therein。
Preferably, the formula existed in the material included, when including material, is carried out extracting and replicates, and set up formula storehouse and individually preserve by comparison database。The material that each formula in described formula storehouse is extracted with it is set up relevant, can be obtained the material full text of its correspondence by the formula in formula storehouse。According to a specific embodiment of the present invention, when including formula, carry out respectively extracting preserving by the respective variable parameter of formula and dependent variable parameter and operative symbol。According to a specific embodiment of the present invention, extract respective variable parameter and the concrete meaning of each parameter of the laggard onestep extraction of dependent variable parameter, dimension and the span of formula, and preserve respectively。According to a specific embodiment of the present invention, after extracting the operative symbol of formula, further to operator in addition in foreign language textual annotation。In formula storehouse, each formula included all is preserved the symbol of each self-corresponding independent variable parameter and dependent variable parameter and is represented, each independent variable, dependent variable concrete meaning middle foreign language statement, the middle foreign language textual annotation of dimension and span and operator AND operator。The purpose setting up formula storehouse in comparison database further is that the material scope of object as a comparison further expands to formula contrast, improves and plagiarizes the comprehensive of detection。Those skilled in the art know, and the other guide in material can also be extracted by comparison database further, for instance chemical formula, gene order etc., does not repeat them here。
According to a specific embodiment of the present invention, described comparison database adopts distributed way to be stored in different site locations;Particular station can be chosen according to the loading condition of different websites when accessing comparison database to conduct interviews。The material quantity being extracted from comparison database in each station statistics current one time period, described material quantity can be the byte number of the number of material or material;Obtain the average load amount of this website;The average load amount of this website is periodically reported doubtful story extraction subsystem by each website;When from comparison database, extraction material is used for choosing doubtful material to described doubtful story extraction subsystem needs, conduct interviews according to the website that the average load amount of each website reported recently chooses average load amount minimum;Unit interval section therein is configured by system;5 minutes, 10 minutes, 30 minutes or 60 minutes can be chosen for according to actual needs。According to a specific embodiment of the present invention, in described comparison database, different word banks can adopt distributed way to be stored in different site locations;The site location deposited according to different word banks when accessing comparison database conducts interviews respectively。Doubtful story extraction subsystem needs to extract material from comparison database when being used for choosing doubtful material, according to the art of to be fetched material or affiliated type, selects different contrast word banks to conduct interviews。
According to a specific embodiment of the present invention, system comprises participle storehouse, be used for including participle and corresponding part of speech。Described participle storehouse is arranged in advance by system, and periodic maintenance, mends and increases neologisms etc.。Preferably, participle storehouse carries out unique number for each participle, it is possible to use W_ID represents a certain participle unique number in participle storehouse。The part of speech of participle is preserved in described participle storehouse, such as noun, verb, adjective, number, measure word, pronoun, adverbial word, preposition, conjunction, auxiliary word, interjection and onomatopoeia。According to a specific embodiment of the present invention, according to part of speech, word segmentation result being divided into notional word and function word, wherein notional word includes noun, verb, adjective, number, measure word and pronoun;Function word includes adverbial word, preposition, conjunction, auxiliary word, interjection, onomatopoeia。Preferably, participle storehouse has been included synonym near synonym storehouse further, wherein participle same or like for implication has been constituted one group, be numbered in units of group。Multiple equivalent in meaning or close participles correspond to a participle group #, it is possible to use WG_ID represents a certain participle unique number in participle storehouse。Preferably, participle storehouse has been included middle foreign language synonym near synonym storehouse further, wherein middle foreign language participle same or like for implication has been constituted one group, be numbered in units of group。Multiple equivalent in meaning or close middle foreign language participles correspond to a middle foreign language participle group #, it is possible to use WFG_ID represents a certain middle foreign language participle group unique number in participle storehouse。
According to a specific embodiment of the present invention, system comprising word-dividing mode, for each material being carried out participle, and word segmentation result being preserved to comparison database。Preferably, word segmentation result is compared by word-dividing mode with the part of speech that participle storehouse preserves, it is determined that the part of speech of word segmentation result。Preferably, word segmentation result is carried out classification process according to the part of speech that word segmentation result is corresponding by participle parts of speech classification module。
According to a specific embodiment of the present invention, system comprising participle group module, for each material being carried out participle, and participle group result being preserved to comparison database。Preferably, word segmentation result is compared by participle group module with the part of speech that participle storehouse preserves, it is determined that the part of speech of participle group result。Preferably, participle group result is carried out classification process according to the part of speech that participle group result is corresponding by participle group parts of speech classification module。
According to a specific embodiment of the present invention, system comprising middle foreign language participle group module, for each material being carried out participle, and middle foreign language participle group result being preserved to comparison database。Preferably, middle foreign language word segmentation result is compared by middle foreign language participle group module with the part of speech that participle storehouse preserves, it is determined that the part of speech of middle foreign language participle group result。Preferably, middle foreign language participle group parts of speech classification module according in part of speech centering foreign language participle group result corresponding to foreign language participle group result carry out classification process。
According to a specific embodiment of the present invention, word segmentation result, participle group result and middle foreign language participle group are divided into A class notional word, B class notional word, C class notional word, D class notional word and V class function word according to part of speech by participle parts of speech classification module, participle group parts of speech classification module and middle foreign language participle group parts of speech classification module respectively, and wherein A class notional word includes noun;B class notional word includes verb, adjective;C class notional word includes number, measure word;D class notional word includes pronoun;V class function word includes adverbial word, preposition, conjunction, auxiliary word, interjection, onomatopoeia。Preferably, noun is divided into by participle storehouse technical term and common noun further。According to a specific embodiment of the present invention, according to part of speech, word segmentation result being divided into A1 class notional word, A2 class notional word, B class notional word, C class notional word, D class notional word and V class function word, wherein A1 class notional word includes technical term noun;A2 class notional word includes common noun;B class notional word includes verb, adjective;C class notional word includes number, measure word;D class notional word includes pronoun;V class function word includes adverbial word, preposition, conjunction, auxiliary word, interjection, onomatopoeia。Those skilled in the art can choose different classification processing schemes according to actual needs。
According to a specific embodiment of the present invention, participle eigenvalue generation module adds up the quantity that each participle occurs in corresponding material, generate the participle eigenvalue WCV=[W_ID that each participle is corresponding, W_N], wherein W_ID represents this participle unique number in participle storehouse, and W_N represents the total degree that this participle occurs in this material。Preferably, it is contemplated that the part of speech of each participle, participle eigenvalue generation module generates participle part of speech eigenvalue WCCV=[W_ID, W_N, W_CHAR], and wherein W_CHAR represents the part of speech of this participle。
According to a specific embodiment of the present invention, participle stack features value generation module adds up the quantity that each participle group occurs in corresponding material, generate the participle stack features value WGCV=[WG_ID that each participle group is corresponding, WG_N], wherein WG_ID represents this participle group unique number in participle storehouse, and WG_N represents the total degree that this participle group occurs in this material。Preferably, it is contemplated that the part of speech of each participle group, participle stack features value generation module generates participle group part of speech eigenvalue WGCCV=[WG_ID, WG_N, WG_CHAR], and wherein WG_CHAR represents the part of speech of this participle group。
According to a specific embodiment of the present invention, middle foreign language participle stack features value generation module adds up the quantity that in each, foreign language participle group occurs in corresponding material, generate the participle stack features value WFGCV=[WFG_ID that foreign language participle group in each is corresponding, WFG_N], wherein WFG_ID represents foreign language participle group unique number in participle storehouse in this, and WFG_N represents the total degree that in this, foreign language participle group occurs in this material。Preferably, it is contemplated that the part of speech of foreign language participle group in each, foreign language participle group part of speech eigenvalue WFGCCV=[WFG_ID, WFG_N, WFG_CHAR] in the generation of participle stack features value generation module, wherein WFG_CHAR represents the part of speech of foreign language participle group in this。
According to a specific embodiment of the present invention, participle tightening coefficient generation module is used for generating participle tightening coefficient。Described participle tightening coefficient refers to that same participle occurs the participle quantity at institute interval for adjacent twice in whole material。According to a specific embodiment of the present invention, the participle tightening coefficient that each participle is corresponding is expressed as WGC=[G_W_ID_1, G_W_ID_2, ..., G_W_ID_ (W_N-1)], wherein, G_W_ID_1 represents that the participle quantity at institute interval between second time appearance occurs in this participle first time in this material, G_W_ID_2 represents that the participle quantity at institute interval between third time appearance occurs in this participle second time in this material, G_W_ID_ (W_N-1) represents that this participle occurs the participle quantity at institute interval between the W_N time appearance for the W_N-1 time in this material;G_W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N-1) is the participle tightening coefficient that this participle is corresponding。According to a specific embodiment of the present invention, participle tightening coefficient feature vector generation module generates participle tightening coefficient characteristic vector W GCVE=[W_ID, W_N, W_CHAR, G_W_ID_1, G_W_ID_2, ..., G_W_ID_ (W_N-1)], wherein W_ID represents this participle unique number in participle storehouse, W_N represents this specific participle participle total degree in this material, and W_CHAR represents the part of speech of this participle。By participle tightening coefficient, it is possible to know specific participle overall distribution situation in corresponding material。
According to a specific embodiment of the present invention, participle group tightening coefficient generation module is used for generating participle group tightening coefficient。Described participle group tightening coefficient refers to that same participle group occurs the participle quantity at institute interval for adjacent twice in whole material。According to a specific embodiment of the present invention, participle group tightening coefficient corresponding to each participle group is expressed as WGGC=[G_WG_ID_1, G_WG_ID_2, ..., G_WG_ID_ (WG_N-1)], wherein, G_WG_ID_1 represents that the participle quantity at institute interval between second time appearance occurs in this participle group first time in this material, G_WG_ID_2 represents that the participle quantity at institute interval between third time appearance occurs in this participle group second time in this material, G_WG_ID_ (WG_N-1) represents that this participle group occurs the participle quantity at institute interval between the WG_N time appearance for the WG_N-1 time in this material;G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1) is the participle group tightening coefficient that this participle group is corresponding。According to a specific embodiment of the present invention, participle group tightening coefficient feature vector generation module generates participle group tightening coefficient characteristic vector W GGCVE=[WG_ID, WG_N, WG_CHAR, G_WG_ID_1, G_WG_ID_2, ..., G_WG_ID_ (WG_N-1)], wherein WG_ID represents this participle group unique number in participle storehouse, WG_N represents this specific participle group participle total degree in this material, and WG_CHAR represents the part of speech of this participle group。By participle group tightening coefficient, it is possible to know specific participle group overall distribution situation in corresponding material。
According to a specific embodiment of the present invention, middle foreign language participle group tightening coefficient generation module is for foreign language participle group tightening coefficient in generating。Described middle foreign language participle group tightening coefficient refers to that same middle foreign language participle group occurs the participle quantity at institute interval for adjacent twice in whole material。According to a specific embodiment of the present invention, the middle foreign language participle group tightening coefficient that in each, foreign language participle group is corresponding is expressed as WFGGC=[G_WFG_ID_1, G_WFG_ID_2, ..., G_WFG_ID_ (WFG_N-1)], wherein, G_WFG_ID_1 represents that in this, the participle quantity at institute interval between second time appearance occurs in foreign language participle group first time in this material, G_WFG_ID_2 represents that in this, the participle quantity at institute interval between third time appearance occurs in foreign language participle group second time in this material, G_WFG_ID_ (WFG_N-1) represents that in this, foreign language participle group occurs the participle quantity at institute interval between the WFG_N time appearance for the WFG_N-1 time in this material;G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-1) the participle group tightening coefficient that to be in this foreign language participle group corresponding。According to a specific embodiment of the present invention, foreign language participle group tightening coefficient characteristic vector W FGGCVE=[WFG_ID in the generation of middle foreign language participle group tightening coefficient feature vector generation module, WFG_N, WFG_CHAR, G_WFG_ID_1, G_WFG_ID_2, ..., G_WFG_ID_ (WFG_N-1)], wherein WFG_ID represents foreign language participle group unique number in participle storehouse in this, WFG_N represents this specific middle foreign language participle group participle total degree in this material, and WFG_CHAR represents the part of speech of foreign language participle group in this。By middle foreign language participle group tightening coefficient, it is possible to know specific middle foreign language participle group overall distribution situation in corresponding material。
According to a specific embodiment of the present invention, participle free vector dimension determines that module determines participle free vector dimension WFV according to the word segmentation result of material;Described participle free vector dimension WFV is equal to the quantity of the different participles obtained after specific material is carried out participle。When the length of material is shorter or word segmentation result therein is less, obtained participle free vector dimension WFV is less;When the length of material is longer or word segmentation result therein is more, obtained participle free vector dimension WFV is more。
According to a specific embodiment of the present invention, participle group free vector dimension determines that module determines participle group free vector dimension WGFV according to the word segmentation result of material;Described participle group free vector dimension WGFV is equal to the quantity of the different participle groups obtained after specific material is carried out participle。When the length of material is shorter or participle group result therein is less, obtained participle group free vector dimension WGFV is less;When the length of material is longer or participle group result therein is more, obtained participle group free vector dimension WGFV is more。
According to a specific embodiment of the present invention, middle foreign language participle group free vector dimension determines that module determines middle foreign language participle group free vector dimension WFGFV according to the word segmentation result of material;Described middle foreign language participle group free vector dimension WFGFV is equal to the quantity of foreign language participle group in the difference obtained after specific material is carried out participle。When the length of material is shorter or middle foreign language participle group result therein is less, obtained middle foreign language participle group free vector dimension WFGFV is less;When the length of material is longer or participle group result therein is more, obtained middle foreign language participle group free vector dimension WFGFV is more。
According to a specific embodiment of the present invention, participle simplifies vector dimension generation module for the participle free vector dimension WFV of each material is simplified, and generates participle and simplifies vector dimension RWV。Described participle is simplified vector dimension RWV and is specified by system。Preferably, system specifies participle to simplify vector dimension RWV is 500。Preferably, system specifies participle to simplify vector dimension RWV is 800。Preferably, system specifies participle to simplify vector dimension RWV is 1000。
According to a specific embodiment of the present invention, participle is simplified vector dimension generation module and is adopted extracted at equal intervals method that participle free vector dimension WFV is simplified。Simplify process as follows: judge whether participle free vector dimension WFV simplifies vector dimension RWV more than participle, if, then participle free vector dimension WFV is simplified vector dimension RWV divided by the participle that system is specified, and obtained quotient is carried out upper rounding operation, obtain further simplifying coefficients R EDU;Then in the eigenvalue corresponding to participle free vector dimension WFV, extract an eigenvalue at interval of REDU-1;After all characteristics extraction, it is judged that whether the quantity of the eigenvalue extracted simplifies vector dimension RWV equal to participle;When the quantity of the eigenvalue extracted simplifies vector dimension RWV equal to participle, then complete participle free vector dimension WFV and simplify;When the quantity of the eigenvalue extracted simplifies vector dimension RWV less than participle, then calculate participle and simplify the difference of vector dimension RWV and eigenvalue quantity;The eigenvalue being not extracted by extracts at random and simplifies the vector dimension RWV eigenvalue equal with the difference quantities of eigenvalue with participle, complete simplifying of participle free vector dimension WFV。
According to a specific embodiment of the present invention, participle is simplified vector dimension generation module and is adopted part of speech screening method that participle free vector dimension WFV is simplified。Simplify process as follows: classified according to corresponding participle part of speech by the eigenvalue of word segmentation result;According to a specific embodiment of the present invention, it is A1 class notional word eigenvalue, A2 class notional word eigenvalue, B class notional word eigenvalue, C class notional word eigenvalue, D class notional word eigenvalue and V class function word eigenvalue by feature value division。Generally, it is considered that role is bigger in the similarity comparison of notional word characteristic of correspondence value, wherein technical term noun more can embody effective content of material than common noun。Add up the quantity AMOUNT_A1 (quantity of A1 class notional word eigenvalue) of lower eigenvalue of all categories, AMOUNT_A2 (quantity of A2 class notional word eigenvalue), AMOUNT_B (quantity of B class notional word eigenvalue), AMOUNT_C (quantity of C class notional word eigenvalue), AMOUNT_D (quantity of D class notional word eigenvalue), AMOUNT_V (quantity of V class notional word eigenvalue) respectively。Calculate participle and simplify the value RWV_S_V of vector dimension RWV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V);If greater than 0, if exiting and this time simplifying;If equal to 0, then complete this time to simplify;If less than 0, then calculate participle further and simplify the value RWV_S_D of vector dimension RWV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWV_S_D quantity from the eigenvalue corresponding to AMOUNT_V, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate participle further and simplify the value RWV_S_C of vector dimension RWV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWV_S_C quantity from the eigenvalue corresponding to AMOUNT_D, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate participle further and simplify the value RWV_S_B of vector dimension RWV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWV_S_B quantity from the eigenvalue corresponding to AMOUNT_C, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate participle further and simplify the value RWV_S_A2 of vector dimension RWV-(AMOUNT_A1+AMOUNT_A2);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWV_S_A2 quantity from the eigenvalue corresponding to AMOUNT_B, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate participle further and simplify the value RWV_S_A1 of vector dimension RWV-AMOUNT_A1;If greater than 0, then the eigenvalue that random extraction is equal with this difference RWV_S_A1 quantity from the eigenvalue corresponding to AMOUNT_A2, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then from the eigenvalue corresponding to AMOUNT_A1, extract the eigenvalue equal with simplifying vector dimension RWV quantity at random, complete this time to simplify。
The value RWV_S_V of vector dimension RWV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V) situation more than 0 is simplified for calculating participle, namely mean that this material length is less or quantity of information is less, be therefore not suitable for adopting eigenvalue to contrast。
Participle free vector dimension WFV represents that itself dimension is little when simplifying vector dimension RWV less than participle, then the value under other dimensions is equivalent to 0。This kind of situation needs Direct Mark in systems, individually includes process。Such as common saying among the people, famous sayings of famous figures etc., be used as index and search use。Follow-up make in full sliding window carry out in full comparison to use。
According to a specific embodiment of the present invention, participle group simplifies vector dimension generation module for the participle group free vector dimension WGFV of each material is simplified, and generates participle group and simplifies vector dimension RWGV。Described participle group is simplified vector dimension RWGV and is specified by system。Preferably, system specifies participle group to simplify vector dimension RWGV is 500。Preferably, system specifies participle group to simplify vector dimension RWGV is 800。Preferably, system specifies participle group to simplify vector dimension RWGV is 1000。
According to a specific embodiment of the present invention, participle group is simplified vector dimension generation module and is adopted extracted at equal intervals method that participle group free vector dimension WGFV is simplified。Simplify process as follows: judge whether participle group free vector dimension WGFV simplifies vector dimension RWGV more than participle group, if, then participle group is specified to simplify vector dimension RWGV divided by system participle group free vector dimension WGFV, and obtained quotient is carried out upper rounding operation, obtain further simplifying coefficients R EDU;Then in the eigenvalue corresponding to participle group free vector dimension WGFV, extract an eigenvalue at interval of REDU-1;After all characteristics extraction, it is judged that whether the quantity of the eigenvalue extracted simplifies vector dimension RWGV equal to participle group;When the quantity of the eigenvalue extracted simplifies vector dimension RWGV equal to participle group, then complete participle group free vector dimension WGFV and simplify;When the quantity of the eigenvalue extracted simplifies vector dimension RWGV less than participle group, then calculate participle group and simplify the difference of vector dimension RWGV and eigenvalue quantity;The eigenvalue being not extracted by extracts at random and simplifies the vector dimension RWGV eigenvalue equal with the difference quantities of eigenvalue with participle group, complete simplifying of participle group free vector dimension WGFV。
According to a specific embodiment of the present invention, participle group is simplified vector dimension generation module and is adopted part of speech screening method that participle group free vector dimension WGFV is simplified。Simplify process as follows: classified according to corresponding participle part of speech by eigenvalue;According to a specific embodiment of the present invention, it is A1 class notional word eigenvalue, A2 class notional word eigenvalue, B class notional word eigenvalue, C class notional word eigenvalue, D class notional word eigenvalue and V class function word eigenvalue by feature value division。Generally, it is considered that role is bigger in the similarity comparison of notional word characteristic of correspondence value, wherein technical term noun more can embody effective content of material than common noun。Add up the quantity AMOUNT_A1 (quantity of A1 class notional word eigenvalue) of lower eigenvalue of all categories, AMOUNT_A2 (quantity of A2 class notional word eigenvalue), AMOUNT_B (quantity of B class notional word eigenvalue), AMOUNT_C (quantity of C class notional word eigenvalue), AMOUNT_D (quantity of D class notional word eigenvalue), AMOUNT_V (quantity of V class notional word eigenvalue) respectively。Calculate participle group and simplify the value RWGV_S_V of vector dimension RWGV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V);If greater than 0, if exiting and this time simplifying;If equal to 0, then complete this time to simplify;If less than 0, then calculate participle group further and simplify the value RWGV_S_D of vector dimension RWGV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWGV_S_D quantity from the eigenvalue corresponding to AMOUNT_V, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate participle further and simplify the value RWGV_S_C of vector dimension RWGV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWGV_S_C quantity from the eigenvalue corresponding to AMOUNT_D, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate participle group further and simplify the value RWGV_S_B of vector dimension RWGV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWGV_S_B quantity from the eigenvalue corresponding to AMOUNT_C, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate participle group further and simplify the value RWGV_S_A2 of vector dimension RWGV-(AMOUNT_A1+AMOUNT_A2);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWGV_S_A2 quantity from the eigenvalue corresponding to AMOUNT_B, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate participle group further and simplify the value RWGV_S_A1 of vector dimension RWGV-AMOUNT_A1;If greater than 0, then the eigenvalue that random extraction is equal with this difference RWGV_S_A1 quantity from the eigenvalue corresponding to AMOUNT_A2, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then from the eigenvalue corresponding to AMOUNT_A1, extract the eigenvalue equal with simplifying vector dimension RWGV quantity at random, complete this time to simplify。
The value RWGV_S_V of vector dimension RWGV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V) situation more than 0 is simplified for calculating participle group, namely mean that this material length is less or quantity of information is less, be therefore not suitable for adopting eigenvalue to contrast。
Participle group free vector dimension WGFV represents that itself dimension is little when simplifying vector dimension RWGV less than participle group, then the value under other dimensions is equivalent to 0。This kind of situation needs Direct Mark in systems, individually includes process。Such as common saying among the people, famous sayings of famous figures etc., be used as index and search use。Follow-up make in full sliding window carry out in full comparison to use。
According to a specific embodiment of the present invention, middle foreign language participle group simplifies vector dimension generation module for the middle foreign language participle group free vector dimension WFGFV of each material is simplified, and in generation, foreign language participle group simplifies vector dimension RWFGV。Described middle foreign language participle group is simplified vector dimension RWFGV and is specified by system。Preferably, in system appointment, foreign language participle group simplifies vector dimension RWFGV is 500。Preferably, in system appointment, foreign language participle group simplifies vector dimension RWFGV is 800。Preferably, in system appointment, foreign language participle group simplifies vector dimension RWFGV is 1000。
According to a specific embodiment of the present invention, middle foreign language participle group is simplified vector dimension generation module and is adopted extracted at equal intervals method centering foreign language participle group free vector dimension WFGFV to simplify。Simplify process as follows: in judgement, whether foreign language participle group free vector dimension WFGFV simplifies vector dimension RWFGV more than middle foreign language participle group, if, in then being specified divided by system by middle foreign language participle group free vector dimension WFGFV, foreign language participle group simplifies vector dimension RWFGV, and obtained quotient is carried out upper rounding operation, obtain further simplifying coefficients R EDU;Then in the eigenvalue corresponding to middle foreign language participle group free vector dimension WFGFV, extract an eigenvalue at interval of REDU-1;After all characteristics extraction, it is judged that whether the quantity of the eigenvalue extracted simplifies vector dimension RWFGV equal to middle foreign language participle group;When the quantity of the eigenvalue extracted simplifies vector dimension RWFGV equal to middle foreign language participle group, then complete middle foreign language participle group free vector dimension WFGFV and simplify;When the quantity of the eigenvalue extracted simplifies vector dimension RWFGV less than middle foreign language participle group, then in calculating, foreign language participle group simplifies the difference of vector dimension RWFGV and eigenvalue quantity;In the eigenvalue being not extracted by, random extraction simplifies the vector dimension RWFGV eigenvalue equal with the difference quantities of eigenvalue with middle foreign language participle group, completes simplifying of middle foreign language participle group free vector dimension WFGFV。
According to a specific embodiment of the present invention, middle foreign language participle group is simplified vector dimension generation module and is adopted part of speech screening method centering foreign language participle group free vector dimension WFGFV to simplify。Simplify process as follows: classified according to corresponding participle part of speech by eigenvalue;According to a specific embodiment of the present invention, it is A1 class notional word eigenvalue, A2 class notional word eigenvalue, B class notional word eigenvalue, C class notional word eigenvalue, D class notional word eigenvalue and V class function word eigenvalue by feature value division。Generally, it is considered that role is bigger in the similarity comparison of notional word characteristic of correspondence value, wherein technical term noun more can embody effective content of material than common noun。Add up the quantity AMOUNT_A1 (quantity of A1 class notional word eigenvalue) of lower eigenvalue of all categories, AMOUNT_A2 (quantity of A2 class notional word eigenvalue), AMOUNT_B (quantity of B class notional word eigenvalue), AMOUNT_C (quantity of C class notional word eigenvalue), AMOUNT_D (quantity of D class notional word eigenvalue), AMOUNT_V (quantity of V class notional word eigenvalue) respectively。In calculating, foreign language participle group simplifies the value RWFGV_S_V of vector dimension RWFGV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V);If greater than 0, if exiting and this time simplifying;If equal to 0, then complete this time to simplify;If less than 0, then in calculating further, foreign language participle group simplifies the value RWFGV_S_D of vector dimension RWFGV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWFGV_S_D quantity from the eigenvalue corresponding to AMOUNT_V, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then in calculating further, foreign language participle group simplifies the value RWFGV_S_C of vector dimension RWFGV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWFGV_S_C quantity from the eigenvalue corresponding to AMOUNT_D, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then in calculating further, foreign language participle group simplifies the value RWFGV_S_B of vector dimension RWFGV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWFGV_S_B quantity from the eigenvalue corresponding to AMOUNT_C, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate participle group further and simplify the value RWFGV_S_A2 of vector dimension RWFGV-(AMOUNT_A1+AMOUNT_A2);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWFGV_S_A2 quantity from the eigenvalue corresponding to AMOUNT_B, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then in calculating further, foreign language participle group simplifies the value RWFGV_S_A1 of vector dimension RWFGV-AMOUNT_A1;If greater than 0, then the eigenvalue that random extraction is equal with this difference RWFGV_S_A1 quantity from the eigenvalue corresponding to AMOUNT_A2, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then from the eigenvalue corresponding to AMOUNT_A1, extract the eigenvalue equal with simplifying vector dimension RWFGV quantity at random, complete this time to simplify。
The value RWFGV_S_V of vector dimension RWFGV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V) situation more than 0 is simplified for foreign language participle group in calculating, namely mean that this material length is less or quantity of information is less, be therefore not suitable for adopting eigenvalue to contrast。
Participle group free vector dimension WFGFV represents that itself dimension is little when simplifying vector dimension RWFGV less than participle group, then the value under other dimensions is equivalent to 0。This kind of situation needs Direct Mark in systems, individually includes process。Such as common saying among the people, famous sayings of famous figures etc., be used as index and search use。Follow-up make in full sliding window carry out in full comparison to use。
According to a specific embodiment of the present invention, participle feature vector generation module is simplified participle described in the vector dimension RWV each material of extraction according to participle and is simplified vector dimension RWV characteristic of correspondence value generation participle characteristic vector W VE_RWV;
WVE_RWV=[W_ID1, W_N1 ..., W_IDi, W_Ni ..., W_IDRWV, W_NRWV]
Wherein W_IDi represents participle unique number in participle storehouse, W_Ni, represents the total degree that this participle occurs in this material, using this number of times eigenvalue as this participle。
According to a specific embodiment of the present invention, participle stack features vector generation module is simplified participle group described in the vector dimension RWGV each material of extraction according to participle group and is simplified vector dimension RWGV characteristic of correspondence value generation participle stack features vector WVE_RWGV;
WVE_RWGV=[WG_ID1, WG_N1 ..., WG_IDi, WG_Ni ..., WG_IDRWGV, WG_NRWGV]
Wherein WG_IDi represents participle group unique number in participle storehouse, and WG_Ni represents the total degree that this participle group occurs in this material, using this number of times eigenvalue as this participle group。
According to a specific embodiment of the present invention, middle foreign language participle stack features vector generation module according in foreign language participle group simplify vector dimension RWFGV and extract foreign language participle group in described in each material and simplify foreign language participle stack features vector WVE_RWFGV during vector dimension RWFGV characteristic of correspondence value generates;
WVE_RWFGV=[WFG_ID1, WFG_N1 ..., WFG_IDi, WFG_Ni ..., WFG_IDRWFGV, WFG_NRWFGV]
Wherein WFG_IDi represent in foreign language participle group unique number in participle storehouse, WFG_Ni represents the total degree that in this, foreign language participle group occurs in this material, using this number of times as the eigenvalue of foreign language participle group in this。
According to a specific embodiment of the present invention, system provides the user multiple access mode。User accesses system, and user's access mode detection module is for detecting the access mode of active user。
In a specific embodiment of the present invention, user can access system in mode on probation, is user on probation referred to hereinafter as the user accessed in mode on probation。When user's access mode detection module detects that user is to access in mode on probation, send prompting to user on probation, inform that current accessed mode is for mode on probation, and inform the use authority of user on probation。According to a specific embodiment of the present invention, for the user accessed in mode on probation, system is only user on probation provides the detection of book character number to try out, and described predetermined number of words is arranged in advance by system。Another embodiment according to the present invention, for the user accessed in mode on probation, system provides the data base of part or all of scope on probation for detection for user on probation。Another embodiment according to the present invention, for the user accessed in mode on probation, system only provides plagiarism rate to point out for the plagiarism testing result that user on probation provides, and does not provide concrete plagiarizing position and contrast with by the plagiarism plagiarizing document。Another embodiment according to the present invention, for the user accessed in mode on probation, the plagiarism testing result that system provides for user on probation provides concrete plagiarism position, but carry out Fuzzy processing to by the plagiarism contrast plagiarizing document, make what user on probation was only capable of knowing the document self provided specifically to plagiarize position, but None-identified is plagiarized the specifying information of document。
According to a specific embodiment of the present invention, user accesses system with counting mode, is counting user referred to hereinafter as the user accessed with counting mode。When user's access mode detection module detects that user is to access with counting mode, send prompting to counting user, inform that current accessed mode is counting mode, and point out counting user to upload the document needing to carry out plagiarizing contrast。According to a specific embodiment of the present invention, system statistics counting user uploads the number of characters of document, and calculates the expense of this text plagiarism detection according to the number of characters counted。Another embodiment according to the present invention, system provides the data base of part or all of scope selective for counting user, and system selects different this texts of data base's range computation to plagiarize the expense of detection according to counting user。
According to a specific embodiment of the present invention, user accesses system with timing mode, is timing user referred to hereinafter as the user accessed with timing mode。When user's access mode detection module detects that user is to access with timing mode, to timing, user sends prompting, informs that current accessed mode is timing mode, and points out timing user's current residual to use duration。Another embodiment according to the present invention, for timing user, in use system provides residue to use duration countdown prompting in real time for timing user in display interface。Another embodiment according to the present invention, system provides the data base of part or all of scope selective for timing user。According to a specific embodiment of the present invention, system selects different data base's scopes and the number of characters of timing user institute uploading detection document according to timing user, estimate the detection duration needed for the document, and point out timing user to remain whether use duration can complete currently to plagiarize detection。
According to a specific embodiment of the present invention, after timing user logs in described system, user detect mode decision module and determine plagiarism detection detection pattern。According to a specific embodiment of the present invention, it is selective that system provides oneself's audit mode, common plagiarism qualification pattern, qualification pattern is plagiarized in extension, multilingual plagiarism identifies that pattern, formula plagiarize qualification pattern。
According to a specific embodiment of the present invention, user detects mode decision module and determines when active user detects pattern for oneself's audit mode, user's writing style test module provides the user one or more test picture, user carry out online describing no less than the word of regulation number of words for test picture at the appointed time。Preferably, user's writing style test module provides the user one or more test articles further, user carry out the text reviews no less than regulation number of words at the appointed time online。Described test picture or test article are tested module by user's writing style and are randomly selected from test picture library and test library。No matter employing is tested picture or tests article, it is required for being undertaken online word description or comment by user, being limited to the stipulated time cannot arrange long, generally being chosen for 30 minutes or 60 minutes, corresponding word describes or the regulation number of words of text reviews is generally chosen for 400 word/30 minute or 800 word/60 minute。Those skilled in the art can arrange other stipulated time or regulation number of words as required further。From experimental data, it is stipulated that the time should not arrange long, to avoid user not have enough time or unstable networks cannot complete corresponding test;Additionally, regulation number of words is unsuitable too low with the ratio of stipulated time, to avoid can not reflecting that user writes custom strictly according to the facts。Being limited to the stipulated time cannot arrange long, corresponding word describes or the length of text reviews is limited, the word only extracted with on-line testing describes or eigenvalue and the characteristic vector of text reviews are likely to also cannot truly reflect that the writing of user is accustomed to, it is thus desirable to extract further test picture to describe reference characteristic vector and test article describes reference characteristic vector, for revising the characteristic vector deviation value caused owing to word describes or text reviews word is not enough。
According to a specific embodiment of the present invention, the every width test picture in test picture library all has test picture reference characteristic vector。It is the benchmark test personnel randomly selecting predetermined quantity from different background crowd that described test picture describes reference characteristic vector, respectively the description no less than regulation number of words is carried out with regard to fc-specific test FC picture, gather all of word to describe, add up the test picture character Expressive Features value of same test picture, characteristic vector is calculated according to described test picture character Expressive Features value, and characteristic vector is computed weighted, obtain the test picture reference characteristic vector of fc-specific test FC picture。Weights in described ranking operation are arranged by system。Every section of test article in test library all has test article reference characteristic vector。Described test article reference characteristic vector is the benchmark test personnel randomly selecting predetermined quantity from different background crowd, respectively the description no less than regulation number of words is carried out with regard to fc-specific test FC article, gather all of word to describe, add up the test article word Expressive Features value for same test article, characteristic vector is calculated according to described test article word Expressive Features value, and characteristic vector is computed weighted, obtain the test article reference characteristic vector of fc-specific test FC article。Weights in described ranking operation are arranged by system。
According to a specific embodiment of the present invention, when randomly selecting the benchmark test personnel of predetermined quantity from different background crowd, it is possible to choose according to all ages and classes level, preferably can be divided into less than 20 years old group, 20-29 year group, 30-39 year group, 40-49 year group, more than 50 years old group。Thus collecting the crowd of age groups for description situation no less than regulation number of words of same test picture or same test article。
According to a specific embodiment of the present invention, when randomly selecting the benchmark test personnel of predetermined quantity from different background crowd, it is possible to choose according to different academic backgrounds level, preferably can be divided into below undergraduate education group, undergraduate education group, Master degree candidate's group, doctoral candidate's group。Thus collecting the crowd of different academic backgrounds group for description situation no less than regulation number of words of same test picture or same test article。
According to a specific embodiment of the present invention, when randomly selecting the benchmark test personnel of predetermined quantity from different background crowd, can choose according to different majors field and (professional field can be divided according to different measuring accuracy demands, do not repeat them here), thus collecting the crowd of different majors field group for description situation no less than regulation number of words of same test picture or same test article。
According to a specific embodiment of the present invention, test picture character Expressive Features value generation module acquisition benchmark test personnel obtain the test picture of benchmark test personnel and describe text, generate user test picture character Expressive Features value;Described test picture character Expressive Features value includes but not limited to: Chinese number of words, foreign language number of words, total word number, notional word number, function word number, paragraph number, bout length distribution situation, sentence number, sentence length distribution situation, synonym, near synonym spread scenarios, function word service condition, punctuation mark service condition, part of speech service condition。According to a specific embodiment of the present invention, Chinese number of words refers to the Chinese character number that each section of test picture character comprises in describing except punctuation mark, and Chinese each word is designated as a character;Foreign language number of words refers to the foreign language number of characters that each section of test picture character comprises in describing except punctuation mark, and each word of foreign language is designated as a character;Total word number refer to each section is tested picture character describe carry out participle after the word sum that obtains, wherein Chinese word segmentation can use the participle storehouse that system carries to carry out participle, and foreign language can carry out participle according to foreign language writing style, the space that directly utilizes between every word;Notional word number compares, after referring to participle, the notional word quantity obtained in each section of test picture character description according to word segmentation result and the part of speech in participle storehouse, notional word number can be further divided into Chinese notional word number and foreign language notional word number, wherein, Chinese notional word number and the summation of foreign language notional word number are equal to notional word number;Function word number compares, after referring to participle, the function word quantity obtained in each section of test picture character description according to word segmentation result and the part of speech in participle storehouse, further function word number can be divided into Chinese function word number and foreign language function word number, wherein, Chinese function word number and the summation of foreign language function word number are equal to function word number;Paragraph number refers to the paragraph quantity in each section of test picture character description;Bout length distribution situation refers to the word number and sentence number that comprise in each paragraph in each section of test picture character description;Sentence number refers to the sentence quantity in each section of test picture character description;Sentence length distribution situation refers to the word number comprised in each sentence in each section of test picture character description;Synonym, near synonym spread scenarios refers to that the word segmentation result tested each section during picture character describes is compared with synonym near synonym storehouse, participle same or like for implication is constituted a set, calculate the word quantity in each set, thus reflect the synonym of the author of this section of test picture character description, near synonym writing custom, if the word number wherein comprised in synonym or near synonym set is more many, show that the writing style of this author tends to adopt synonym or near synonym extension, if the word number comprised in synonym or near synonym set is more few, show that the writing style of this author tends to not adopt synonym or near synonym extension;Function word service condition refers to the statistical conditions that in each section of test picture character description, function word uses, include but not limited to the statistics ranking that in each section of test picture character description, function word uses, the word number at interval, the word number at interval between each identical function word between each different function word;Such as can also add up the service condition of " ", " ", " obtaining " three structural auxiliary words further, thus reflect author that this section of test picture character describe is for whether " ", " ", " obtaining " three structural auxiliary words distinguish use;Punctuation mark service condition refers to the statistical conditions that in each section of test picture character description, punctuation mark uses, include but not limited to the statistics ranking that in each section of test picture character description, punctuate uses, the word number at interval, the word number at interval between each identical punctuation mark between each different punctuation mark;Part of speech service condition compares according to the part of speech in word segmentation result and participle storehouse after referring to participle and obtains the statistical conditions of each part of speech participle during each section of test picture character describes, such as respectively obtain the quantity of noun, verb, adjective, number, measure word, pronoun, adverbial word, preposition, conjunction, auxiliary word, interjection and onomatopoeia, and each part of speech quantity describes the ratio of total word number with this section of test picture character。
According to a specific embodiment of the present invention, test picture character Expressive Features value generation module generates test picture character Expressive Features vector according to test picture character Expressive Features value。According to a specific embodiment of the present invention, system specify the dimension of described test picture character Expressive Features vector and the order of particular content every in characteristic vector and arrangement。When the dimension of the characteristic vector that described test picture character describes is n, it is represented by TPCVE=[TPC_1, ..., TPC_m ..., TPC_n], wherein, TPC_1 is the first entry value in the characteristic vector that test picture character describes, and TPC_m is the m entry value in the characteristic vector that test picture character describes, and TPC_n is the n-th entry value in the characteristic vector that test picture character describes。
Preferably, it is one or more that described test picture character Expressive Features vector includes in the following: the ratio of Chinese number of words and total word number, the ratio of foreign language number of words and total word number, the ratio of notional word number and total word number, the ratio of function word number and total word number, the ratio of total word number and paragraph number, the longest paragraph word number, synonym, the ratio of near synonym spreading number and total word number, punctuation mark uses the ratio of number and total word number, the ratio of noun number and total word number, the ratio of verb number and total word number, the ratio of adjective number and total word number, the ratio of number number and total word number, the ratio of measure word number and total word number, the ratio of pronoun number and total word number, the ratio of adverbial word number and total word number, the ratio of preposition number and total word number, the ratio of conjunction number and total word number, the ratio of auxiliary word number and total word number, the ratio of interjection number and total word number, the ratio of onomatopoeia number and total word number。
According to a specific embodiment of the present invention, test picture reference characteristic vector generation module statistics is for the test picture character Expressive Features vector of same test;Test picture character Expressive Features vector being computed weighted, obtains fc-specific test FC picture reference characteristic vector, the weights used in described ranking operation are arranged by system。Preferably, test picture reference characteristic vector generation module can for age groups, educational background group and professional field group, add up the test picture character Expressive Features vector of predetermined quantity respectively, and compute weighted respectively, obtain the fc-specific test FC picture reference characteristic vector of each age group, each educational background group and each professional field group。
Fc-specific test FC picture reference characteristic vector can be expressed as:
T P C V E _ I D = [ Σ i = 1 k T P C _ 1 i * W 1 , i , ... Σ i = 1 k T P C _ m i * W m , i , ... , Σ i = 1 k T P C _ n i * W n , i ]
Wherein TPCVE_ID represents the test picture reference characteristic vector being numbered ID;K is benchmark test personnel amount;TPC_1iRepresent the first entry value of the characteristic vector of i-th benchmark test personnel;TPC_miRepresent the m entry value of the characteristic vector of i-th benchmark test personnel;TPC_niRepresent the n-th entry value of the characteristic vector of i-th benchmark test personnel;W1,iFor TPC_1iWeight coefficient;Wm,iFor TPC_miWeight coefficient;Wn,,iFor TPC_niWeight coefficient。
According to a specific embodiment of the present invention, test article word Expressive Features value generation module acquisition benchmark test personnel obtain the test article of benchmark test personnel and describe text, generate user test article word Expressive Features value;Described test article word Expressive Features value includes but not limited to: Chinese number of words, foreign language number of words, total word number, notional word number, function word number, paragraph number, bout length distribution situation, sentence number, sentence length distribution situation, synonym, near synonym spread scenarios, function word service condition, punctuation mark service condition, part of speech service condition。According to a specific embodiment of the present invention, Chinese number of words refers to the Chinese character number that each section of test article word comprises in describing except punctuation mark, and Chinese each word is designated as a character;Foreign language number of words refers to the foreign language number of characters that each section of test article word comprises in describing except punctuation mark, and each word of foreign language is designated as a character;Word number refer to each section is tested article word describe carry out participle after the word sum that obtains, wherein Chinese word segmentation can use the participle storehouse that system carries to carry out participle, and foreign language can carry out participle according to foreign language writing style, the space that directly utilizes between every word;Notional word number compares, after referring to participle, the notional word quantity obtained in each section of test article word description according to word segmentation result and the part of speech in participle storehouse, notional word number can be further divided into Chinese notional word number and foreign language notional word number, wherein, Chinese notional word number and the summation of foreign language notional word number are equal to notional word number;Function word number compares, after referring to participle, the function word quantity obtained in each section of test article word description according to word segmentation result and the part of speech in participle storehouse, further function word number can be divided into Chinese function word number and foreign language function word number, wherein, Chinese function word number and the summation of foreign language function word number are equal to function word number;Paragraph number refers to the paragraph quantity in each section of test article word description;Bout length distribution situation refers to the word number and sentence number that comprise in each paragraph in each section of test article word description;Sentence number refers to the sentence quantity in each section of test article word description;Sentence length distribution situation refers to the word number comprised in each sentence in each section of test article word description;Synonym, near synonym spread scenarios refers to that the word segmentation result tested each section during article word describes is compared with synonym near synonym storehouse, participle same or like for implication is constituted a set, calculate the word quantity in each set, thus reflect the synonym of the author of this section of test article word description, near synonym writing custom, if the word number wherein comprised in synonym or near synonym set is more many, show that the writing style of this author tends to adopt synonym or near synonym extension, if the word number comprised in synonym or near synonym set is more few, show that the writing style of this author tends to not adopt synonym or near synonym extension;Function word service condition refers to the statistical conditions that in each section of test article word description, function word uses, include but not limited to the statistics ranking that in each section of test article word description, function word uses, the word number at interval, the word number at interval between each identical function word between each different function word;Such as can also add up the service condition of " ", " ", " obtaining " three structural auxiliary words further, thus reflect author that this section of test article word describe is for whether " ", " ", " obtaining " three structural auxiliary words distinguish use;Punctuation mark service condition refers to the statistical conditions that in each section of test article word description, punctuation mark uses, include but not limited to the statistics ranking that in each section of test article word description, punctuate uses, the word number at interval, the word number at interval between each identical punctuation mark between each different punctuation mark;Part of speech service condition compares according to the part of speech in word segmentation result and participle storehouse after referring to participle and obtains the statistical conditions of each part of speech participle during each section of test article word describes, such as respectively obtain the quantity of noun, verb, adjective, number, measure word, pronoun, adverbial word, preposition, conjunction, auxiliary word, interjection and onomatopoeia, and each part of speech quantity describes the ratio of total word number with this section of test article word。
According to a specific embodiment of the present invention, test article word Expressive Features value generation module generates test picture character Expressive Features vector according to test article word Expressive Features value。According to a specific embodiment of the present invention, system specify the dimension of described test article word Expressive Features vector and the order of particular content every in characteristic vector and arrangement。When the dimension of the characteristic vector that described test article word describes is n, it is represented by TTCVE=[TTC_1, ..., TTC_m ..., TTC_n], wherein, TTC_1 is the first entry value in the characteristic vector that test picture character describes, and TTC_m is the m entry value in the characteristic vector that test picture character describes, and TTC_n is the n-th entry value in the characteristic vector that test picture character describes。
Preferably, it is one or more that described test article word Expressive Features vector includes in the following: the ratio of Chinese number of words and total word number, the ratio of foreign language number of words and total word number, the ratio of notional word number and total word number, the ratio of function word number and total word number, the ratio of total word number and paragraph number, the longest paragraph word number, synonym, the ratio of near synonym spreading number and total word number, punctuation mark uses the ratio of number and total word number, the ratio of noun number and total word number, the ratio of verb number and total word number, the ratio of adjective number and total word number, the ratio of number number and total word number, the ratio of measure word number and total word number, the ratio of pronoun number and total word number, the ratio of adverbial word number and total word number, the ratio of preposition number and total word number, the ratio of conjunction number and total word number, the ratio of auxiliary word number and total word number, the ratio of interjection number and total word number, the ratio of onomatopoeia number and total word number。
According to a specific embodiment of the present invention, test article reference characteristic vector generation module statistics is for the test article word Expressive Features vector of same test;Test article word Expressive Features vector being computed weighted, obtains fc-specific test FC article reference characteristic vector, the weights used in described ranking operation are arranged by system。Preferably, test article reference characteristic vector generation module can for age groups, educational background group and professional field group, add up the test article word Expressive Features vector of predetermined quantity respectively, and compute weighted respectively, obtain the fc-specific test FC article reference characteristic vector of each age group, each educational background group and each professional field group。
Certain articles reference characteristic vector can be expressed as:
T T C V E _ I D = [ Σ i = 1 k T T C _ 1 i * W 1 , i , ... Σ i = 1 k T T C _ m i * W m , i , ... , Σ i = 1 k T T C _ n i * W n , i ]
Wherein TTCVE_ID represents the test article reference characteristic vector being numbered ID;K is benchmark test personnel amount;TTC_1iRepresent the first entry value of the characteristic vector of i-th benchmark test personnel;TTC_miRepresent the m entry value of the characteristic vector of i-th benchmark test personnel;TTC_niRepresent the n-th entry value of the characteristic vector of i-th benchmark test personnel;W1,iFor TPC_1iWeight coefficient;Wm,iFor TPC_miWeight coefficient;Wn,,iFor TPC_niWeight coefficient。
According to a specific embodiment of the present invention, the dimension of test picture character Expressive Features vector and test article word Expressive Features vector, and the wherein implication of each eigenvalue and put in order and all keep consistent。Such as, Section 1 eigenvalue in test picture character Expressive Features vector and test article word Expressive Features vector can be set and be the ratio of Chinese number of words and total word number, Section 2 eigenvalue is the ratio of foreign language number of words and total word number, Section 3 eigenvalue is the ratio of notional word number and total word number, Section 4 eigenvalue is the ratio of function word number and total word number, Section 5 eigenvalue is the ratio of total word number and paragraph number, Section 6 eigenvalue is the longest paragraph word number, Section 7 eigenvalue is synonym, the ratio of near synonym spreading number and total word number, Section 8 eigenvalue is punctuation mark and uses the ratio of number and total word number, Section 9 eigenvalue is the ratio of noun number and total word number, Section 10 eigenvalue is the ratio of verb number and total word number, Section 11 eigenvalue is the ratio of adjective number and total word number, Section 12 eigenvalue is the ratio of number number and total word number, Section 13 eigenvalue is the ratio of measure word number and total word number, Section 14 eigenvalue is the ratio of pronoun number and total word number, Section 15 item eigenvalue is the ratio of adverbial word number and total word number, Section 16 eigenvalue is the ratio of preposition number and total word number, Section 17 eigenvalue is the ratio of conjunction number and total word number, Section 18 eigenvalue is the ratio of auxiliary word number and total word number, Section 19 eigenvalue is the ratio of interjection number and total word number, Section 20 eigenvalue is the ratio of onomatopoeia number and total word number。
According to a specific embodiment of the present invention, can increase or delete test picture character Expressive Features vector and the eigenvalue in test article word Expressive Features vector further, but increase or delete the test picture character Expressive Features vector after eigenvalue and the dimension of test article word Expressive Features vector and wherein the implication of various features value and order to still need to maintenance consistent。
According to a specific embodiment of the present invention, user test picture character Expressive Features value generation module obtains user test picture and describes text, generates user test picture character Expressive Features value;The content that described user test picture character Expressive Features value comprises with test picture character Expressive Features value is consistent, does not repeat them here。User test picture character Expressive Features vector generation module calculates user test picture character Expressive Features vector according to this user test picture character Expressive Features value;When the dimension of described test picture character Expressive Features vector is n, the characteristic vector that the test picture character of the picture for numbering ID of active user USER describes is represented by TPCVE_ID_USER=[TPC_1_USER, ..., TPC_m_USER, ..., TPC_n_USER], wherein, TPC_1_USER is the first entry value in the user test picture character Expressive Features vector of active user USER, TPC_m_USER is the m entry value in the user test picture character Expressive Features vector of active user USER, TPC_n_USER is the n-th entry value in the user test picture character Expressive Features vector of active user USER。
User's picture writing style feature vector generation module calculates the difference between test picture reference characteristic vector T PCVE_ID corresponding with this test picture for this user test picture character Expressive Features vector T PCVE_ID_USER, uses this difference (TPCVE_ID_USER-TPCVE_ID) as this user picture writing style characteristic vector TPCVE_USER。
T P C V E _ U S E R = [ T P C _ 1 _ U S E R - Σ i = 1 k T P C _ 1 i * W 1 , i , ... T P C _ m _ U S E R - Σ i = 1 k T P C _ m i * W m , i , ... , T P C _ n _ U S E R - Σ i = 1 k T P C _ n i * W n , i ]
According to a specific embodiment of the present invention, user test article word Expressive Features value generation module obtains user test article and describes text, generates user test article word Expressive Features value;The content that described user test article word Expressive Features value comprises with test article word Expressive Features value is consistent, does not repeat them here。User test article word Expressive Features vector generation module calculates user test article word Expressive Features vector according to this user test article word Expressive Features value;When the dimension of described test article word Expressive Features vector is n, the characteristic vector that the test article word of the article for numbering ID of active user USER describes is represented by: TTCVE_ID_USER=[TTC_1_USER, ..., TTC_m_USER, ..., TTC_n_USER], wherein, TTC_1_USER is the first entry value in the user test article word Expressive Features vector of active user USER, TTC_m_USER is the m entry value in the user test article word Expressive Features vector of active user USER, TTC_n_USER is the n-th entry value in the user test article word Expressive Features vector of active user USER。
User's article writing style and features vector generation module calculates the difference between test article reference characteristic vector T PCVE_ID corresponding with this test article for this user test article word Expressive Features vector T TCVE_ID_USER, uses this difference (TTCVE_ID_USER-TTCVE_ID) as this user article writing style and features vector T TCVE_USER。
T T C V E _ U S E R = [ T T C _ 1 _ U S E R - Σ i = 1 k T T C _ 1 i * W 1 , i , ... T T C _ m _ U S E R - Σ i = 1 k T T C _ m i * W m , i , ... , T T C _ n _ U S E R - Σ i = 1 k T T C _ n i * W n , i ]
According to a specific embodiment of the present invention, when adopting several test pictures or many sections of test articles, or when adopting one or more test picture and one or more test articles simultaneously, user test picture character Expressive Features value generation module and user test article word Expressive Features value generation module describe text according to every section of user test picture respectively and test article describes text generation user test picture and/or article word Expressive Features value, user test picture character Expressive Features vector generation module and user test article word Expressive Features vector generation module generate user test picture and/or article word Expressive Features vector respectively according to user test picture and/or article word Expressive Features value;User's picture writing style feature vector generation module and user's article writing style and features vector generation module calculate respectively each user test picture and/or article word Expressive Features vectorial with corresponding test picture and/or article reference characteristic vector between difference;Each difference is computed weighted and respectively obtains the picture writing style characteristic vector TPCVE_USER and article writing style and features vector T TCVE_USER of user;The picture writing style characteristic vector TPCVE_USER and article writing style and features vector T TCVE_USER of user are computed weighted and obtain user writing style characteristic vector TVE_USER by user's writing style feature vector generation module;The weights of described ranking operation can be chosen according to actual needs。
TVE_USER=TPCVE_USER*WP+TTCVE_USER*WT
Wherein, WPFor user's picture writing style characteristic vector TPCVE_USER weight coefficient;WTFor user's article writing style and features vector T TCVE_USER weight coefficient。When only carrying out picture writing test or article writing test as user, the weight coefficient of the project of participation can being set to 1, the weight coefficient having neither part nor lot in project is set to 0。Preferably, weights can be chosen for equal。
User's writing style characteristic vector is represented by: TVE_USER=[TVE_1, ..., TVE_m, ..., TVE_n], wherein, TVE_1 is the first entry value in user's writing style characteristic vector, TVE_m is the m entry value in user's writing style characteristic vector, and TVE_n is the n-th entry value in user's writing style characteristic vector。
According to a specific embodiment of the present invention, user detects mode decision module for pointing out user to upload pending document further;Pending file characteristics value generation module is for generating the pending file characteristics value of this unexamined document。Described pending file characteristics value includes but not limited to: Chinese number of words, foreign language number of words, total word number, notional word number, function word number, paragraph number, bout length distribution situation, sentence number, sentence length distribution situation, synonym, near synonym spread scenarios, function word service condition, punctuation mark service condition, part of speech service condition。According to a specific embodiment of the present invention, Chinese number of words refers to the Chinese character number comprised except punctuation mark in the pending document of each section, and Chinese each word is designated as a character;Foreign language number of words refers to the foreign language number of characters comprised except punctuation mark in the pending document of each section, and each word of foreign language is designated as a character;Word number refers to the word sum obtained after the pending document of each section is carried out participle, and wherein Chinese word segmentation can use the participle storehouse that system carries to carry out participle, and foreign language can carry out participle according to foreign language writing style, the space that directly utilizes between every word;Notional word number compares, with the part of speech in participle storehouse, the notional word quantity obtained in the pending document of each section according to word segmentation result after referring to participle, notional word number can be further divided into Chinese notional word number and foreign language notional word number, wherein, Chinese notional word number and the summation of foreign language notional word number are equal to notional word number;Function word number compares, with the part of speech in participle storehouse, the function word quantity obtained in the pending document of each section according to word segmentation result after referring to participle, further function word number can be divided into Chinese function word number and foreign language function word number, wherein, Chinese function word number and the summation of foreign language function word number are equal to function word number;Paragraph number refers to the paragraph quantity in the pending document of each section;Bout length distribution situation refers to the word number and sentence number that comprise in each paragraph in the pending document of each section;Sentence number refers to the sentence quantity in the pending document of each section;Sentence length distribution situation refers to the word number comprised in each sentence in the pending document of each section;Synonym, near synonym spread scenarios refers to compares the word segmentation result in pending for each section document and synonym near synonym storehouse, participle same or like for implication is constituted a set, calculate the word quantity in each set, thus reflect the synonym of the author of this section of pending document, near synonym writing custom, if the word number wherein comprised in synonym or near synonym set is more many, show that the writing style of this author tends to adopt synonym or near synonym extension, if the word number comprised in synonym or near synonym set is more few, show that the writing style of this author tends to not adopt synonym or near synonym extension;Function word service condition refers to the statistical conditions that in the pending document of each section, function word uses, and includes but not limited to the statistics ranking that in the pending document of each section, function word uses, the word number at interval, the word number at interval between each identical function word between each different function words;Such as can also add up the service condition of " ", " ", " obtaining " three structural auxiliary words further, thus reflect the author of this section of pending document is for whether " ", " ", " obtaining " three structural auxiliary words distinguish use;Punctuation mark service condition refers to the statistical conditions that in the pending document of each section, punctuation mark uses, include but not limited to the statistics ranking that in the pending document of each section, punctuate uses, the word number at interval, the word number at interval between each identical punctuation mark between each different punctuation mark;Part of speech service condition compares according to the part of speech in word segmentation result and participle storehouse after referring to participle and obtains the statistical conditions of each part of speech participle in the pending document of each section, such as respectively obtain the quantity of noun, verb, adjective, number, measure word, pronoun, adverbial word, preposition, conjunction, auxiliary word, interjection and onomatopoeia and the ratio of each part of speech quantity and this section of total word number of pending document。
According to a specific embodiment of the present invention, pending file characteristics value tag vector generation module generates pending file characteristics vector according to pending file characteristics value。According to a specific embodiment of the present invention, system specify the dimension of the characteristic vector of described pending document and the order of particular content every in characteristic vector and arrangement;The dimension of the characteristic vector of pending document, and the order of particular content every in characteristic vector and arrangement should with test picture reference characteristic vector and test article reference characteristic vector dimension and wherein various features value implication and order still need to keep consistent。When the dimension of the characteristic vector of described pending document is n, it is represented by TDCVE_USER=[TDC_1, ..., TDC_m ..., TDC_n], wherein, TDC_1 is the first entry value in the characteristic vector of pending document, and TDC_m is the m entry value in the characteristic vector of pending document, and TDC_n is the n-th entry value in the characteristic vector of pending document。
Preferably, the characteristic vector of described pending document includes the ratio of Chinese number of words and total word number, the ratio of foreign language number of words and total word number, the ratio of notional word number and total word number, the ratio of function word number and total word number, the ratio of total word number and paragraph number, the longest paragraph word number, synonym, the ratio of near synonym spreading number and total word number, punctuation mark uses the ratio of number and total word number, the ratio of noun number and total word number, the ratio of verb number and total word number, the ratio of adjective number and total word number, the ratio of number number and total word number, the ratio of measure word number and total word number, the ratio of pronoun number and total word number, the ratio of adverbial word number and total word number, the ratio of preposition number and total word number, the ratio of conjunction number and total word number, the ratio of auxiliary word number and total word number, the ratio of interjection number and total word number, the ratio of onomatopoeia number and total word number。
User's writing style similarity calculation module is used for calculating active user's writing style similarity, can be calculated by below equation:
Sim T ( U S E R ) = D I S T ( T D C V E _ U S E R - T V E _ U S E R ) = ( T D C _ 1 - T V C _ 1 ) 2 + ... + ( T D C _ m - T V C _ m ) 2 + ... + ( T D C _ n - T V C _ n ) 2
User's writing style similarity judge module is by active user writing style similarity SimT(USER) audit thresholding with the oneself of systemic presupposition to compare;As user writing style similarity SimT(USER) time higher than described oneself's examination & verification thresholding, namely it is believed that the pending document of active user's submission is inconsistent with user's writing style;When user writing style similarity SimT (USER) is lower than described oneself's examination & verification thresholding, namely it is believed that the pending document of active user's submission is consistent with user's writing style。
Described oneself's examination & verification thresholding is that system is arranged in advance。Oneself's examination & verification threshold value arranges too high, then the pending document easily causing erroneous judgement active user's submission is inconsistent with user's writing style;Oneself's examination & verification threshold value arranges too low, then the pending document easily causing erroneous judgement active user's submission is consistent with user's writing style。Generally, carried out choosing checking by experiment in advance by system during described oneself's examination & verification threshold value, and can be adjusted at any time according to ruuning situation by system。
According to a specific embodiment of the present invention, first oneself's examination & verification thresholding and second oneself's examination & verification thresholding can be respectively provided with;Described first oneself's examination & verification thresholding is higher than second oneself's examination & verification thresholding;When user writing style similarity SimT (USER) is higher than described first oneself's examination & verification thresholding, namely it is believed that the pending document of active user's submission is inconsistent with user's writing style;As user writing style similarity SimT(USER) time lower than described second oneself's examination & verification thresholding, namely it is believed that the pending document of active user's submission is consistent with user's writing style;As user writing style similarity SimT(USER) greater than or equal to described second oneself's examination & verification thresholding, and less than or equal to described first oneself's examination & verification thresholding;Checking user's writing style further。
Described first oneself's examination & verification thresholding and second oneself's examination & verification thresholding are that system is arranged in advance。If first oneself's examination & verification threshold value arranges too high, then the pending document easily causing erroneous judgement active user's submission is inconsistent with user's writing style;Second oneself's examination & verification threshold value arranges too low, then the pending document easily causing erroneous judgement active user's submission is consistent with user's writing style;Between first oneself's examination & verification thresholding and second oneself's examination & verification thresholding, interval arranges excessive, then easily cause too much checking user's writing style again。Generally, described first oneself's examination & verification threshold value and second oneself's examination & verification threshold value are carried out choosing checking in advance by experiment by system, and can be adjusted at any time according to ruuning situation by system。
According to a specific embodiment of the present invention, described further checking user's writing style refers to user's writing style structural auxiliary word judge module;Judge the service condition of pending document and user test picture describe text and/or user test article describes in text " ", " ", " obtaining " three structural auxiliary words, thus reflect the author of this section of pending document and the active user differentiation degree for " ", " ", " obtaining " three structural auxiliary words。Described user's writing style structural auxiliary word judge module judges pending document " ", " ", the service condition of " obtaining " three structural auxiliary words refers to, add up pending document in full in the access times of " ", " ", " obtaining ", be designated as T respectively1、T2And T3;Add up further pending document in full in " " after with the number of times that part of speech is noun of participle, be designated as D1;Add up pending document in full in " " after with the number of times that part of speech is verb of participle, be designated as D2;Add up pending document in full in " " after be adjectival number of times with the part of speech of participle, be designated as D3;Calculate " " after institute with participle the number of times that part of speech is noun with in full in " " the ratio D of use total degree1/T1;Calculate " " after institute with participle the number of times that part of speech is verb with in full in " " the ratio D of use total degree2/T2;Calculate institute after " obtaining " with participle the number of times that part of speech is verb with in full in the ratio D of use total degree of " obtaining "3/T3;Calculate " ", " ", " obtain " differentiation coefficient DC_TD。The numerical value of described differentiation coefficient DC_TD is more than or equal to 0, less than or equal to 3。
D C _ T D = Σ i = 1 3 ( D i / T i )
Described user test picture describes text and/or user test article describes in text " ", " ", the service condition of " obtaining " three structural auxiliary words refers to, counting user test picture describes text and/or user test article describes text (such as this user test several pictures and/or plurality of articles in full, then all of description text is merged as full text) in the access times of " ", " ", " obtaining ", be designated as T respectively1’、T2' and T3';Add up further pending document in full in " " after with the number of times that part of speech is noun of participle, be designated as D1';Add up pending document in full in " " after with the number of times that part of speech is verb of participle, be designated as D2';Add up pending document in full in " " after be adjectival number of times with the part of speech of participle, be designated as D3';Calculate " " after institute with participle the number of times that part of speech is noun with in full in " " the ratio D of use total degree1’/T1';Calculate " " after institute with participle the number of times that part of speech is verb with in full in " " the ratio D of use total degree2’/T2';Calculate institute after " obtaining " with participle the number of times that part of speech is verb with in full in the ratio D of use total degree of " obtaining "3’/T3';Calculate " ", " ", " obtain " differentiation coefficient DC_TPT。The numerical value of described differentiation coefficient DC_TPT is more than or equal to 0, less than or equal to 3。
D C _ T P T = Σ i = 1 3 ( D i ′ / T i ′ )
User's writing style structural auxiliary word judge module;Calculate the drift rate DC-SC distinguishing coefficient DC_TD and distinguishing between coefficient DC_TPT, namely the absolute value of differentiation coefficient DC_TD and the difference distinguishing coefficient DC_TPT is normalized computing。
D C _ S C = | D C _ T D - D C _ T P T | 3 × 100 %
When judgement thresholding less than or equal to drift rate DC-SC of the value of DC_SC, then user's writing style structural auxiliary word judge module judge the author of pending document text described with test picture and/or test article describe the user of text " ", " ", " obtaining " three structural auxiliary words use on style consistent;When judgement thresholding more than drift rate DC-SC of the value of DC_SC, then user's writing style structural auxiliary word judge module judge the author of pending document and test picture text described and/or test article describe the user of text " ", " ", " obtaining " three structural auxiliary words use on style inconsistent。The judgement threshold value of drift rate DC-SC is configured in advance by system, and can be adjusted at any time according to actual needs。By the experimental data that system early stage is run, when the value of DC_SC is less than or equal to 10%, can reflect preferably the author of pending document text described with test picture and/or test article describe the user of text " ", " ", " obtaining " three structural auxiliary words use on style consistent;When the value of DC_SC is more than 10%, then it is believed that the author of pending document and test picture describe text and/or test article describe the user of text " ", " ", " obtaining " three structural auxiliary words use on style inconsistent。
User's writing style judge module is for as user writing style similarity SimT(USER) greater than or equal to described second oneself's examination & verification thresholding, and less than or equal to described first oneself's examination & verification thresholding;Judge that whether the pending document that active user submits to is consistent with user's writing style by drift rate DC-SC further;When the drift rate DC-SC judgement thresholding more than drift rate DC-SC, it is believed that the pending document that active user submits to is inconsistent with user's writing style;When the drift rate DC-SC judgement thresholding less than or equal to drift rate DC-SC, namely it is believed that the pending document of active user's submission is consistent with user's writing style。
According to a specific embodiment of the present invention, user's access mode detection module prompting user uploads document to be identified。
User detects mode decision module and judges that active user detects pattern when being common plagiarism qualification pattern, and document word-dividing mode to be identified, for document to be identified is carried out participle, obtains word segmentation result;When document to be identified is carried out word segmentation processing, it is necessary to adopt and carry out, with the material of comparison database, the handling process that participle is identical。
According to a specific embodiment of the present invention, document participle parts of speech classification module to be identified;The part of speech corresponding for obtaining word segmentation result further。The participle mode classification of the material that participle parts of speech classification mode is included with comparison database is consistent。
According to a specific embodiment of the present invention, document participle eigenvalue generation module to be identified is used for generating document participle eigenvalue to be identified;Add up the quantity that each participle occurs in corresponding document to be identified, obtain the participle eigenvalue WCV_TBI=[W_ID that each participle is corresponding, W_N], wherein W_ID represents this participle unique number in participle storehouse, and W_N represents the total degree that this participle occurs in this document to be identified。Preferably, consider the part of speech of each participle, obtain participle part of speech eigenvalue WCCV_TBI=[W_ID, W_N, W_CHAR], wherein W_ID represents this participle unique number in participle storehouse, and W_N represents this specific participle participle total degree in this document to be identified, and W_CHAR represents the part of speech of this participle。
According to a specific embodiment of the present invention, document participle tightening coefficient generation module to be identified is used for generating document participle tightening coefficient to be identified。According to a specific embodiment of the present invention, the participle tightening coefficient that each participle is corresponding can be expressed as WGC_TBI=[G_W_ID_1, G_W_ID_2, ..., G_W_ID_ (W_N-1)], wherein, G_W_ID_1 represents that the participle quantity at institute interval between second time appearance occurs in this participle first time in this document to be identified, G_W_ID_2 represents that the participle quantity at institute interval between third time appearance occurs in this participle second time in this document to be identified, G_W_ID_ (W_N-1) represents that this participle occurs the participle quantity at institute interval between the W_N time appearance for the W_N-1 time in this document to be identified;G_W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N-1) is the participle tightening coefficient that this participle is corresponding。According to a specific embodiment of the present invention, further participle tightening coefficient corresponding for each participle can be expressed as in vector form participle tightening coefficient characteristic vector W GCVE_TBI=[W_ID, W_N, W_CHAR, G_W_ID_1, G_W_ID_2, ..., G_W_ID_ (W_N-1)], wherein W_ID represents this participle unique number in participle storehouse, W_N represents this specific participle participle total degree in this document to be identified, W_CHAR represents the part of speech of this participle, G_W_ID_1 represents that the participle quantity at institute interval between second time appearance occurs in this participle first time in this document to be identified, G_W_ID_2 represents that the participle quantity at institute interval between third time appearance occurs in this participle second time in this document to be identified, G_W_ID_ (W_N-1) represents that this participle occurs the participle quantity at institute interval between the W_N time appearance for the W_N-1 time in this document to be identified。Wherein, G_W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N-1) is the participle part of speech characteristic vector tightening coefficient that this participle is corresponding。By participle characteristic vector tightening coefficient, specific participle overall distribution situation in corresponding document to be identified can be known, thus it is long at document entirety length to be identified, or describe in the scattered situation of viewpoint, it is to avoid according to participle total degree W_N or screen participle characteristic vector according to (W_N/ participle free vector dimension WFV) and omit crucial participle eigenvalue。Preferably, it is also possible to extract in a certain document to be identified specific part for contrasting according to participle characteristic vector tightening coefficient。
According to a specific embodiment of the present invention, document participle free vector dimension to be identified determines module, determines participle free vector dimension WFV_TBI for the word segmentation result according to document to be identified。When the length of document to be identified is shorter or word segmentation result therein is less, obtained participle free vector dimension WFV_TBI is less;When the length of document to be identified is longer or word segmentation result therein is more, obtained participle free vector dimension WFV_TBI is more。
User detects mode decision module and judges that when active user detects pattern as extension plagiarism qualification pattern, document participle group module to be identified, for document to be identified is carried out participle, obtains participle group result;The participle that wherein implication is same or like constitutes one group, is numbered in units of group。Multiple equivalent in meaning or close participles correspond to a participle group #;When document to be identified is carried out word segmentation processing, it is necessary to adopt and carry out, with the material of comparison database, the handling process that participle is identical。
According to a specific embodiment of the present invention, document participle group parts of speech classification module to be identified;The part of speech corresponding for obtaining participle group result further。The participle group mode classification of the material that participle group parts of speech classification mode is included with comparison database is consistent。
According to a specific embodiment of the present invention, document participle stack features value generation module to be identified is used for generating document participle stack features value to be identified;Add up the quantity that each participle group occurs in corresponding document to be identified, obtain the participle eigenvalue WGCV_TBI=[WG_ID that each participle group is corresponding, WG_N], wherein WG_ID represents this participle group unique number in participle storehouse, and WG_N represents the total degree that this participle group occurs in this document to be identified。Preferably, consider the part of speech of each participle group, obtain participle group part of speech eigenvalue WGCCV_TBI=[WG_ID, WG_N, WG_CHAR], wherein WG_ID represents this participle group unique number in participle storehouse, and WG_N represents this specific participle group participle total degree in this document to be identified, and WG_CHAR represents the part of speech of this participle group。
According to a specific embodiment of the present invention, document participle group tightening coefficient generation module to be identified is used for generating document participle tightening coefficient to be identified。According to a specific embodiment of the present invention, participle tightening coefficient corresponding to each participle group can be expressed as WGGC_TBI=[G_WG_ID_1, G_WG_ID_2, ..., G_WG_ID_ (WG_N-1)], wherein, G_WG_ID_1 represents that the participle quantity at institute interval between second time appearance occurs in this participle group first time in this document to be identified, G_WG_ID_2 represents that the participle quantity at institute interval between third time appearance occurs in this participle group second time in this document to be identified, G_WG_ID_ (WG_N-1) represents that this participle group occurs the participle quantity at institute interval between the W_N time appearance for the W_N-1 time in this document to be identified;G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1) is the participle group tightening coefficient that this participle group is corresponding。According to a specific embodiment of the present invention, further participle group tightening coefficient corresponding for each participle group can be expressed as in vector form participle group tightening coefficient characteristic vector W GGCVE_TBI=[WG_ID, WG_N, WG_CHAR, G_WG_ID_1, G_WG_ID_2, ..., G_WG_ID_ (WG_N-1)], wherein WG_ID represents this participle group unique number in participle storehouse, WG_N represents this specific participle group participle total degree in this document to be identified, WG_CHAR represents the part of speech of this participle group, G_WG_ID_1 represents that the participle quantity at institute interval between second time appearance occurs in this participle group first time in this document to be identified, G_WG_ID_2 represents that the participle quantity at institute interval between third time appearance occurs in this participle group second time in this document to be identified, G_WG_ID_ (WG_N-1) represents that this participle group occurs the participle quantity at institute interval between the W_N time appearance for the W_N-1 time in this document to be identified。Wherein, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1) is the participle part of speech characteristic vector tightening coefficient that this participle group is corresponding。By participle stack features vector tightening coefficient, specific participle group overall distribution situation in corresponding document to be identified can be known, thus it is long at document entirety length to be identified, or describe in the scattered situation of viewpoint, it is to avoid according to participle total degree W_N or screen participle characteristic vector according to (W_N/ participle free vector dimension WFV) and omit crucial participle eigenvalue。Preferably, it is also possible to extract in a certain document to be identified specific part for contrasting according to participle characteristic vector tightening coefficient。
According to a specific embodiment of the present invention, document participle group free vector dimension to be identified determines module, determines participle group free vector dimension WGFV_TBI for the word segmentation result according to document to be identified。When the length of document to be identified is shorter or word segmentation result therein is less, obtained participle group free vector dimension WGFV_TBI is less;When the length of document to be identified is longer or word segmentation result therein is more, obtained participle group free vector dimension WGFV_TBI is more。
User detects mode decision module and judges that active user detects pattern when being multilingual plagiarism qualification pattern, and in document to be identified, foreign language participle group module is for carrying out participle to document to be identified, obtains middle foreign language participle group result;The middle foreign language participle that wherein implication is same or like constitutes one group, is numbered in units of group。Multiple equivalent in meaning or close middle foreign language participles correspond to a middle foreign language participle group #。When document to be identified is carried out word segmentation processing, it is necessary to adopt and carry out, with the material of comparison database, the handling process that participle is identical。
According to a specific embodiment of the present invention, document participle group parts of speech classification module to be identified;The part of speech corresponding for obtaining participle group result further。The participle group mode classification of the material that participle group parts of speech classification mode is included with comparison database is consistent。
According to a specific embodiment of the present invention, in document to be identified, foreign language participle stack features value generation module is used for generating foreign language participle stack features value in document to be identified;Add up the quantity that in each, foreign language participle group occurs in corresponding document to be identified, obtain the participle eigenvalue WFGCV_TBI=[WFG_ID that foreign language participle group in each is corresponding, WFG_N], wherein WFG_ID represents foreign language participle group unique number in participle storehouse in this, and WFG_N represents the total degree that in this, foreign language participle group occurs in this document to be identified。Preferably, consider the part of speech of foreign language participle group in each, obtain middle foreign language participle group part of speech eigenvalue WFGCCV_TBI=[WFG_ID, WFG_N, WFG_CHAR], wherein FWG_ID represents foreign language participle group unique number in participle storehouse in this, and WFG_N represents this specific middle foreign language participle group participle total degree in this document to be identified, and WFG_CHAR represents the part of speech of foreign language participle group in this。
According to a specific embodiment of the present invention, in document to be identified, foreign language participle group tightening coefficient generation module is used for generating foreign language participle tightening coefficient in document to be identified。According to a specific embodiment of the present invention, the middle foreign language participle tightening coefficient that in each, foreign language participle group is corresponding can be expressed as WFGGC_TBI=[G_WFG_ID_1, G_WFG_ID_2, ..., G_WFG_ID_ (WFG_N-1)], wherein, G_WFG_ID_1 represents that in this, the participle quantity at institute interval between second time appearance occurs in foreign language participle group first time in this document to be identified, G_WFG_ID_2 represents that in this, the participle quantity at institute interval between third time appearance occurs in foreign language participle group second time in this document to be identified, G_WFG_ID_ (WFG_N-1) represents that in this, foreign language participle group occurs the participle quantity at institute interval between the W_N time appearance for the W_N-1 time in this document to be identified;G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-1) the middle foreign language participle group tightening coefficient that to be in this foreign language participle group corresponding。According to a specific embodiment of the present invention, further middle foreign language participle group tightening coefficient corresponding for foreign language participle group in each can be expressed as in vector form middle foreign language participle group tightening coefficient characteristic vector W FGGCVE_TBI=[WFG_ID, WFG_N, WFG_CHAR, G_WFG_ID_1, G_WFG_ID_2, ..., G_WFG_ID_ (WFG_N-1)], wherein WFG_ID represents foreign language participle group unique number in participle storehouse in this, WFG_N represents this specific middle foreign language participle group participle total degree in this document to be identified, WFG_CHAR represents the part of speech of foreign language participle group in this, G_WFG_ID_1 represents that in this, the participle quantity at institute interval between second time appearance occurs in foreign language participle group first time in this document to be identified, G_WFG_ID_2 represents that in this, the participle quantity at institute interval between third time appearance occurs in foreign language participle group second time in this document to be identified, G_WFG_ID_ (WG_N-1) represents that in this, foreign language participle group occurs the participle quantity at institute interval between the W_N time appearance for the W_N-1 time in this document to be identified。Wherein, G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-1) the participle part of speech characteristic vector tightening coefficient that to be in this foreign language participle group corresponding。By middle foreign language participle stack features vector tightening coefficient, it is possible to know specific middle foreign language participle group overall distribution situation in corresponding document to be identified。
According to a specific embodiment of the present invention, in document to be identified, foreign language participle group free vector dimension determines module, determines middle foreign language participle group free vector dimension WFGFV_TBI for the word segmentation result according to document to be identified。When the length of document to be identified is shorter or word segmentation result therein is less, obtained middle foreign language participle group free vector dimension WFGFV_TBI is less;When the length of document to be identified is longer or word segmentation result therein is more, obtained participle group free vector dimension WFGFV_TBI is more。
According to a specific embodiment of the present invention, document participle to be identified simplifies vector dimension generation module for the participle free vector dimension WFV_TBI of document to be identified is simplified, and generates document participle to be identified and simplifies vector dimension RWV_TBI。Described participle is simplified vector dimension RWV_TBI and is specified by described system。Preferably, system specifies participle to simplify vector dimension RWV_TBI is 500。Preferably, system specifies participle to simplify vector dimension RWV_TBI is 800。Preferably, simplified system specifies participle to simplify vector dimension RWV_TBI is 1000。
According to a specific embodiment of the present invention, document participle to be identified is simplified vector dimension generation module and is adopted extracted at equal intervals method that document participle free vector dimension WFV_TBI to be identified is simplified。Simplify process as follows: judge whether document participle free vector dimension WFV_TBI to be identified simplifies vector dimension RWV_TBI more than document participle to be identified, if, then document participle to be identified is specified to simplify vector dimension RWV_TBI divided by simplified system document participle free vector dimension WFV_TBI to be identified, and obtained quotient is carried out upper rounding operation, obtain document to be identified further and simplify coefficients R EDU_TBI;Then in the eigenvalue corresponding to document participle free vector dimension WFV_TBI to be identified, extract an eigenvalue at interval of REDU_TBI-1;After all characteristics extraction, it is judged that whether the quantity of the eigenvalue extracted simplifies vector dimension RWV_TBI equal to document participle to be identified;When the quantity of the eigenvalue extracted simplifies vector dimension RWV_TBI equal to document participle to be identified, then complete document participle free vector dimension WFV_TBI to be identified and simplify;When the quantity of the eigenvalue extracted simplifies vector dimension RWV_TBI less than document participle to be identified, then calculate document participle to be identified and simplify the difference of vector dimension RWV_TBI and eigenvalue quantity;The eigenvalue being not extracted by extracts at random and simplifies the vector dimension RWV_TBI eigenvalue equal with the difference quantities of eigenvalue with document participle to be identified, complete simplifying of document participle free vector dimension WFV_TBI to be identified。
According to a specific embodiment of the present invention, document participle to be identified is simplified vector dimension generation module and is adopted part of speech screening method that document participle free vector dimension WFV_TBI to be identified is simplified。Simplify process as follows: classified according to corresponding participle part of speech by eigenvalue;According to a specific embodiment of the present invention, it is A1 class notional word eigenvalue, A2 class notional word eigenvalue, B class notional word eigenvalue, C class notional word eigenvalue, D class notional word eigenvalue and V class function word eigenvalue by feature value division。Generally, it is considered that role is bigger in the similarity comparison of notional word characteristic of correspondence value, wherein technical term noun more can embody effective content of document to be identified than common noun。Add up the quantity AMOUNT_A1 (quantity of A1 class notional word eigenvalue) of lower eigenvalue of all categories, AMOUNT_A2 (quantity of A2 class notional word eigenvalue), AMOUNT_B (quantity of B class notional word eigenvalue), AMOUNT_C (quantity of C class notional word eigenvalue), AMOUNT_D (quantity of D class notional word eigenvalue), AMOUNT_V (quantity of V class notional word eigenvalue) respectively。Calculate document participle to be identified and simplify the value RWV_TBI_S_V of vector dimension RWV_TBI-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V);If greater than 0, if exiting and this time simplifying;If equal to 0, then complete this time to simplify;If less than 0, then calculate document participle to be identified further and simplify the value RWV_S_D of vector dimension RWV_TBI-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWV_TBI_S_D quantity from the eigenvalue corresponding to AMOUNT_V, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate document participle to be identified further and simplify the value RWV_TBI_S_C of vector dimension RWV_TBI-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWV_TBI_S_C quantity from the eigenvalue corresponding to AMOUNT_D, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate document participle to be identified further and simplify the value RWV_TBI_S_B of vector dimension RWV_TBI-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWV_TBI_S_B quantity from the eigenvalue corresponding to AMOUNT_C, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate document participle to be identified further and simplify the value RWV_TBI_S_A2 of vector dimension RWV_TBI-(AMOUNT_A1+AMOUNT_A2);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWV_TBI_S_A2 quantity from the eigenvalue corresponding to AMOUNT_B, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate document participle to be identified further and simplify the value RWV_TBI_S_A1 of vector dimension RWV_TBI-AMOUNT_A1;If greater than 0, then the eigenvalue that random extraction is equal with this difference RWV_TBI_S_A1 quantity from the eigenvalue corresponding to AMOUNT_A2, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then from the eigenvalue corresponding to AMOUNT_A1, random extraction simplifies, with document participle to be identified, the eigenvalue that vector dimension RWV_TBI quantity is equal, completes this time to simplify。
The value RWV_TBI_S_V of vector dimension RWV_TBI-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V) situation more than 0 is simplified for calculating document participle to be identified, namely mean that this document length to be identified is less or quantity of information is less, be therefore not suitable for adopting eigenvalue to contrast。
When document participle free vector dimension WFV_TBI to be identified simplifies vector dimension RWV_TBI less than document participle to be identified, the dimension of expression own is little, then the value under other dimensions is equivalent to 0, it is possible to Direct Mark in systems, individually includes process。
According to a specific embodiment of the present invention, document participle group to be identified simplifies vector dimension generation module for the participle group free vector dimension WGFV_TBI of document to be identified is simplified, and generates document participle group to be identified and simplifies vector dimension RGWV_TBI。Described participle group is simplified vector dimension RWGV_TBI and is specified by described system。Preferably, system specifies participle group to simplify vector dimension RWGV_TBI is 500。Preferably, system specifies participle group to simplify vector dimension RWGV_TBI is 800。Preferably, simplified system specifies participle group to simplify vector dimension RWGV_TBI is 1000。
According to a specific embodiment of the present invention, document participle group to be identified is simplified vector dimension generation module and is adopted extracted at equal intervals method that document participle group free vector dimension WGFV_TBI to be identified is simplified。Simplify process as follows: judge whether document participle group free vector dimension WGFV_TBI to be identified simplifies vector dimension RWGV_TBI more than document participle group to be identified, if, then document participle group to be identified is specified to simplify vector dimension RWGV_TBI divided by simplified system document participle group free vector dimension WGFV_TBI to be identified, and obtained quotient is carried out upper rounding operation, obtain further simplifying coefficients R EDU_TBI;Then in the eigenvalue corresponding to document participle group free vector dimension WGFV to be identified, extract an eigenvalue at interval of REDU_TBI-1;After all characteristics extraction, it is judged that whether the quantity of the eigenvalue extracted simplifies vector dimension RWGV_TBI equal to document participle group to be identified;When the quantity of the eigenvalue extracted simplifies vector dimension RWGV_TBI equal to document participle group to be identified, then complete document participle group free vector dimension WGFV_TBI to be identified and simplify;When the quantity of the eigenvalue extracted simplifies vector dimension RWGV_TBI less than document participle group to be identified, then calculate document participle group to be identified and simplify the difference of vector dimension RWGV_TBI and eigenvalue quantity;In the eigenvalue being not extracted by, random extraction simplifies the vector dimension RWGV_TBI eigenvalue equal with the difference quantities of eigenvalue with document participle group to be identified, completes simplifying of document participle group free vector dimension WGFV_TBI to be identified。
According to a specific embodiment of the present invention, document participle group to be identified is simplified vector dimension generation module and is adopted part of speech screening method that document participle group free vector dimension WGFV_TBI to be identified is simplified。Simplify process as follows: classified according to corresponding participle group part of speech by eigenvalue;According to a specific embodiment of the present invention, it is A1 class notional word eigenvalue, A2 class notional word eigenvalue, B class notional word eigenvalue, C class notional word eigenvalue, D class notional word eigenvalue and V class function word eigenvalue by feature value division。Generally, it is considered that role is bigger in the similarity comparison of notional word characteristic of correspondence value, wherein technical term noun more can embody effective content of document to be identified than common noun。Add up the quantity AMOUNT_A1 (quantity of A1 class notional word eigenvalue) of lower eigenvalue of all categories, AMOUNT_A2 (quantity of A2 class notional word eigenvalue), AMOUNT_B (quantity of B class notional word eigenvalue), AMOUNT_C (quantity of C class notional word eigenvalue), AMOUNT_D (quantity of D class notional word eigenvalue), AMOUNT_V (quantity of V class notional word eigenvalue) respectively。Calculate document participle group to be identified and simplify the value RWGV_TBI_S_V of vector dimension RWGV_TBI-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V);If greater than 0, if exiting and this time simplifying;If equal to 0, then complete this time to simplify;If less than 0, then calculate document participle group to be identified further and simplify the value RWGV_S_D of vector dimension RWGV_TBI-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWGV_TBI_S_D quantity from the eigenvalue corresponding to AMOUNT_V, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate document participle group to be identified further and simplify the value RWGV_TBI_S_C of vector dimension RWGV_TBI-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWGV_TBI_S_C quantity from the eigenvalue corresponding to AMOUNT_D, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate document participle group to be identified further and simplify the value RWGV_TBI_S_B of vector dimension RWGV_TBI-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWGV_TBI_S_B quantity from the eigenvalue corresponding to AMOUNT_C, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate document participle group to be identified further and simplify the value RWV_TBI_S_A2 of vector dimension RWGV_TBI-(AMOUNT_A1+AMOUNT_A2);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWGV_TBI_S_A2 quantity from the eigenvalue corresponding to AMOUNT_B, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate document participle group to be identified further and simplify the value RWGV_TBI_S_A1 of vector dimension RWGV_TBI-AMOUNT_A1;If greater than 0, then the eigenvalue that random extraction is equal with this difference RWGV_TBI_S_A1 quantity from the eigenvalue corresponding to AMOUNT_A2, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then from the eigenvalue corresponding to AMOUNT_A1, random extraction simplifies, with document participle group to be identified, the eigenvalue that vector dimension RWGV_TBI quantity is equal, completes this time to simplify。
The value RWGV_TBI_S_V of vector dimension RWGV_TBI-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V) situation more than 0 is simplified for calculating document participle group to be identified, namely mean that this document length to be identified is less or quantity of information is less, be therefore not suitable for adopting eigenvalue to contrast。
When document participle group free vector dimension WGFV_TBI to be identified simplifies vector dimension RWGV_TBI less than document participle group to be identified, the dimension of expression own is little, then the value under other dimensions is equivalent to 0, it is possible to Direct Mark in systems, individually includes process。
According to a specific embodiment of the present invention, in document to be identified, foreign language participle group simplifies vector dimension generation module for the middle foreign language participle group free vector dimension WFGFV_TBI of document to be identified is simplified, and generates foreign language participle group in document to be identified and simplifies vector dimension RFGWV_TBI。Described middle foreign language participle group is simplified vector dimension RWFGV_TBI and is specified by described system。Preferably, in system appointment, foreign language participle group simplifies vector dimension RWFGV_TBI is 500。Preferably, in system appointment, foreign language participle group simplifies vector dimension RWFGV_TBI is 800。Preferably, in simplified system appointment, foreign language participle group simplifies vector dimension RWFGV_TBI is 1000。
According to a specific embodiment of the present invention, in document to be identified foreign language participle group simplify vector dimension generation module adopt extracted at equal intervals method foreign language participle group free vector dimension WFGFV_TBI in document to be identified is simplified。Simplify process as follows: judge in document to be identified, whether foreign language participle group free vector dimension WFGFV_TBI simplifies vector dimension RWFGV_TBI more than foreign language participle group in document to be identified, if, then foreign language participle group in document to be identified is specified to simplify vector dimension RWFGV_TBI divided by simplified system foreign language participle group free vector dimension WFGFV_TBI in document to be identified, and obtained quotient is carried out upper rounding operation, obtain further simplifying coefficients R EDU_TBI;Then in document to be identified, eigenvalue corresponding to foreign language participle group free vector dimension WFGFV extracts an eigenvalue at interval of REDU_TBI-1;After all characteristics extraction, it is judged that whether the quantity of the eigenvalue extracted simplifies vector dimension RWFGV_TBI equal to foreign language participle group in document to be identified;When the quantity of the eigenvalue extracted simplifies vector dimension RWFGV_TBI equal to foreign language participle group in document to be identified, then complete foreign language participle group free vector dimension WFGFV_TBI in document to be identified and simplify;When foreign language participle group simplifies vector dimension RWFGV_TBI during the quantity of the eigenvalue extracted is less than document to be identified, then calculate foreign language participle group in document to be identified and simplify the difference of vector dimension RWFGV_TBI and eigenvalue quantity;In the eigenvalue being not extracted by, random extraction simplifies the vector dimension RWFGV_TBI eigenvalue equal with the difference quantities of eigenvalue with foreign language participle group in document to be identified, completes simplifying of foreign language participle group free vector dimension WFGFV_TBI in document to be identified。
According to a specific embodiment of the present invention, in document to be identified foreign language participle group simplify vector dimension generation module adopt part of speech screening method foreign language participle group free vector dimension WFGFV_TBI in document to be identified is simplified。Simplify process as follows: classified according to corresponding middle foreign language participle group part of speech by eigenvalue;According to a specific embodiment of the present invention, it is A1 class notional word eigenvalue, A2 class notional word eigenvalue, B class notional word eigenvalue, C class notional word eigenvalue, D class notional word eigenvalue and V class function word eigenvalue by feature value division。Generally, it is considered that role is bigger in the similarity comparison of notional word characteristic of correspondence value, wherein technical term noun more can embody effective content of document to be identified than common noun。Add up the quantity AMOUNT_A1 (quantity of A1 class notional word eigenvalue) of lower eigenvalue of all categories, AMOUNT_A2 (quantity of A2 class notional word eigenvalue), AMOUNT_B (quantity of B class notional word eigenvalue), AMOUNT_C (quantity of C class notional word eigenvalue), AMOUNT_D (quantity of D class notional word eigenvalue), AMOUNT_V (quantity of V class notional word eigenvalue) respectively。Calculate document participle group to be identified and simplify the value RWFGV_TBI_S_V of vector dimension RWFGV_TBI-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V);If greater than 0, if exiting and this time simplifying;If equal to 0, then complete this time to simplify;If less than 0, then calculate foreign language participle group in document to be identified further and simplify the value RWFGV_S_D of vector dimension RWFGV_TBI-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWFGV_TBI_S_D quantity from the eigenvalue corresponding to AMOUNT_V, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate foreign language participle group in document to be identified further and simplify the value RWFGV_TBI_S_C of vector dimension RWFGV_TBI-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWFGV_TBI_S_C quantity from the eigenvalue corresponding to AMOUNT_D, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate foreign language participle group in document to be identified further and simplify the value RWFGV_TBI_S_B of vector dimension RWFGV_TBI-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWFGV_TBI_S_B quantity from the eigenvalue corresponding to AMOUNT_C, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate foreign language participle group in document to be identified further and simplify the value RWV_TBI_S_A2 of vector dimension RWFGV_TBI-(AMOUNT_A1+AMOUNT_A2);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWFGV_TBI_S_A2 quantity from the eigenvalue corresponding to AMOUNT_B, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate foreign language participle group in document to be identified further and simplify the value RWGV_TBI_S_A1 of vector dimension RWFGV_TBI-AMOUNT_A1;If greater than 0, then the eigenvalue that random extraction is equal with this difference RWFGV_TBI_S_A1 quantity from the eigenvalue corresponding to AMOUNT_A2, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then from the eigenvalue corresponding to AMOUNT_A1, random extraction simplifies, with document participle group to be identified, the eigenvalue that vector dimension RWFGV_TBI quantity is equal, completes this time to simplify。
The value RWFGV_TBI_S_V of vector dimension RWFGV_TBI-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V) situation more than 0 is simplified for calculating foreign language participle group in document to be identified, namely mean that this document length to be identified is less or quantity of information is less, be therefore not suitable for adopting eigenvalue to contrast。
In document to be identified, foreign language participle group free vector dimension WFGFV_TBI is less than when in document to be identified, foreign language participle group simplifies vector dimension RWFGV_TBI, the dimension of expression own is little, then the value under other dimensions is equivalent to 0, it is possible to Direct Mark in systems, individually includes process。
Preferably, for ease of similarity comparison, the material participle selected in system simplify vector dimension RWV simplify vector dimension RWV_TBI with the participle of document to be identified should be equal;Material participle group simplify vector dimension RWGV simplify vector dimension RWGV_TBI with the participle group of document to be identified should be equal;In material foreign language participle group simplify vector dimension RWFGV simplify vector dimension RWFGV_TBI with the middle foreign language participle group of document to be identified should be equal。
According to a specific embodiment of the present invention, document participle feature vector generation module to be identified, simplify according to participle and the vector dimension RWV_TBI each document to be identified of extraction is simplified with described document participle to be identified vector dimension RWV_TBI characteristic of correspondence value generation document participle characteristic vector W VE_RWV_TBI to be identified, wherein
WVE_RWV_TBI=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV_TBI,W_NRWV_TBI]
Wherein W_IDiRepresent participle unique number in participle storehouse, W_NiRepresent the total degree that this participle occurs in this document to be identified, using this number of times eigenvalue as this participle。
According to a specific embodiment of the present invention, user detects mode decision module and judges that active user detects pattern when being common plagiarism qualification pattern, when carrying out similarity comparison, document participle feature vector generation module to be identified generates the participle characteristic vector W VE_RWV_TBI of document to be identified;WVE_RWV_TBI=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV_TBI,W_NRWV_TBI], the dimension of the participle characteristic vector of document to be identified is RWV_TBI;Participle feature vector generation module generates the participle characteristic vector W VE_RWV of material in comparison database;WVE_RWV=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV,W_NRWV];Wherein, the dimension RWV_TBI of the participle characteristic vector of document to be identified is equal to the dimension RWV of participle characteristic vector。
Although it should be noted that and all adopting W_ID in participle characteristic vector W VE_RWV_TBI and WVE_RWViRepresent participle unique number in participle storehouse, W_NiRepresent the total degree that this participle occurs in this document to be identified, and using this number of times eigenvalue as this participle, but it should be noted that the W_ID in participle characteristic vector W VE_RWV_TBIiThere is a strong possibility and the W_ID in WVE_RWViAnd differ。Therefore when carrying out similarity comparison, it is necessary to the dimension of two participle characteristic vectors is adjusted to consistent。
According to a specific embodiment of the present invention, file characteristics vector adjusting module to be identified, for by W_ID corresponding for all eigenvalues in participle characteristic vector W VE_RWV_TBIiValue carries out ascending order or descending the W_ID that will lack according to the numbering in participle storehouseiValue is inserted, the participle numbering W_ID of insertioniCorresponding eigenvalue is 0;Assume that the numbering of the participle in participle storehouse adds up to W, then needing the participle numbering number inserted is W-RWV_TBI, the document participle characteristic vector W VE_RWV_TBI_EXT=[W_ID to be identified being thus expandedTBI_EXT_1,W_NTBI_EXT_1,...,W_IDTBI_EXT_i,W_NTBI_EXT_i,...,W_IDTBI_EXT_RWV_TBI,W_NTBI_EXT_RWV_TBI,...,W_IDW,W_NW]。
According to a specific embodiment of the present invention, material characteristic vector adjusting module, for by W_ID corresponding for all eigenvalues in participle characteristic vector W VE_RWViValue carries out ascending order or descending the W_ID that will lack according to the numbering in participle storehouseiValue is inserted, the participle numbering W_ID of insertioniCorresponding eigenvalue is 0;Assume that the numbering of the participle in participle storehouse adds up to W, then needing the participle numbering number inserted is W-RWV, the participle characteristic vector W VE_RWV_EXT=[W_ID being thus expandedEXT_1,W_NEXT_1,...,W_IDEXT_i,W_NEXT_i,...,W_IDEXT_RWV,W_NEXT_RWV,...,W_IDW,W_NW]。
By the way, the dimension of the participle characteristic vector of the material in document to be identified and comparison database is all expanded to W, and by carrying out ascending order according to the numbering in participle storehouse or descending carries out unifying arrangement, thus the dimension of two participle characteristic vector characteristic of correspondence values is consistent。
Common plagiarism identifies similarity calculation module, calculates the similarity between the arbitrary material in document to be identified and comparison database;Calculated by below equation:
S i m ( W V E _ R W V _ T B I , W V E _ R W V ) = S i m ( W V E _ R W V _ T B I _ E X T , W V E _ R W V _ E X T ) = 2 Σ i = 1 w W _ N T B I _ E X T _ i × W _ N E X T _ i Σ i = 1 w W _ N T B I _ E X T _ i 2 + Σ i = 1 w W _ N E X T _ i 2 + Σ i = 1 w W _ N T B I _ E X T _ i 2 × Σ i = 1 w W _ N E X T _ i 2
According to a specific embodiment of the present invention, user detects mode decision module and judges when active user detects pattern as extension plagiarism qualification pattern, when carrying out similarity comparison, document participle stack features vector generation module to be identified generates the participle stack features vector WVE_RWGV_TBI of document to be identified;WVE_RWGV_TBI=[WG_ID1,WG_N1,...,WG_IDi,WG_Ni,...,WG_IDRWGV_TBI,WG_NRWGV_TBI], the dimension of the participle stack features vector of document to be identified is RWGV_TBI;Participle stack features vector generation module generates the participle stack features vector WVE_RWGV of material in comparison database;WVE_RWGV=[WG_ID1,WG_N1,...,WG_IDi,WG_Ni,...,WG_IDRWGV,WG_NRWGV];Wherein WG_IDiRepresent participle group unique number in participle storehouse, WG_NiRepresent the total degree that this participle group occurs in this document to be identified, using this number of times eigenvalue as this participle group。Wherein, the dimension RWGV_TBI of the participle stack features vector of document to be identified is equal to the dimension RWGV of participle stack features vector。
Similar with the processing procedure of common plagiarism qualification pattern, according to a specific embodiment of the present invention, extension is plagiarized and is identified file characteristics vector adjusting module to be identified, adjusts the document participle stack features vector WVE_RWGV_TBI_EXT=[WG_ID to be identified being expandedTBI_EXT_1,WG_NTBI_EXT_1,...,WG_IDTBI_EXT_i,WG_NTBI_EXT_i,...,WG_IDTBI_EXT_RWV_TBI,WG_NTBI_EXT_RWGV_TBI,...,WG_IDW,WG_NW];Material characteristic vector adjusting module, adjusts the participle stack features vector WVE_RWGV_EXT=[WG_ID being expandedEXT_1,WG_NEXT_1,...,WG_IDEXT_i,WG_NEXT_i,...,WG_IDEXT_RWV,WG_NEXT_RWGV,...,WG_IDW,W_NW]。The participle stack features vector WVE_RWGV_TBI_EXT=[WG_ID of extensionTBI_EXT_1,WG_NTBI_EXT_1,...,WG_IDTBI_EXT_i,WG_NTBI_EXT_i,...,WG_IDTBI_EXT_RWGV_TBI,WG_NTBI_EXT_RWGV_TBI,...,WG_IDW,WG_NW]。
By the way, the dimension of the participle characteristic vector of the material in document to be identified and comparison database is all expanded to W, and by carrying out ascending order according to the numbering in participle storehouse or descending carries out unifying arrangement, thus the dimension of two participle characteristic vector characteristic of correspondence values is consistent。
Extension is plagiarized and is identified similarity calculation module, calculates the similarity between the arbitrary material in document to be identified and comparison database;Calculated by below equation:
S i m ( W V E _ R W G V _ T B I , W V E _ R W G V ) = S i m ( W V E _ R W G V _ T B I _ E X T , W V E _ R W G V _ E X T ) = 2 Σ i = 1 w W G _ N T B I _ E X T _ i × W G _ N E X T _ i Σ i = 1 w W G _ N T B I _ E X T _ i 2 + Σ i = 1 w W G _ N E X T _ i 2 + Σ i = 1 w W G _ N T B I _ E X T _ i 2 × Σ i = 1 w W G _ N E X T _ i 2
According to a specific embodiment of the present invention, user detects mode decision module and judges that active user detects pattern when being multilingual plagiarism qualification pattern, when carrying out similarity comparison, in document to be identified, foreign language participle stack features vector generation module generates the middle foreign language participle stack features vector WVE_RWFGV_TBI of document to be identified;WVE_RWFGV_TBI=[WFG_ID1,WFG_N1,...,WFG_IDi,WFG_Ni,...,WFG_IDRWFGV_TBI,WFG_NRWFGV_TBI], the dimension of the middle foreign language participle stack features vector of document to be identified is RWFGV_TBI;Participle stack features vector generation module generates the middle foreign language participle stack features vector WVE_RWFGV of material in comparison database;WVE_RWFGV=[WFG_ID1,WFG_N1,...,WFG_IDi,WFG_Ni,...,WFG_IDRWFGV,WFG_NRWFGV];Wherein WFG_IDiForeign language participle group unique number in participle storehouse, WFG_N in expressioniRepresent the total degree that in this, foreign language participle group occurs in this document to be identified, using this number of times as the eigenvalue of foreign language participle group in this。Wherein, the dimension RWFGV_TBI of the middle foreign language participle stack features vector of document to be identified is equal to the dimension RWFGV of middle foreign language participle stack features vector。
Similar with the processing procedure of common plagiarism qualification pattern, according to a specific embodiment of the present invention, under multilingual plagiarism qualification pattern, file characteristics vector adjusting module to be identified, adjusts foreign language participle stack features vector WVE_RWFGV_TBI_EXT=[WFG_ID in the document to be identified being expandedTBI_EXT_1,WFG_NTBI_EXT_1,...,WFG_IDTBI_EXT_i,WFG_NTBI_EXT_i,...,WFG_IDTBI_EXT_RWFGV_TBI,WFG_NTBI_EXT_RWFGV_TBI,...,WFG_IDW,WFG_NW];Material characteristic vector adjusting module, adjusts the participle stack features vector WVE_RWFGV_EXT=[WFG_ID being expandedEXT_1,WFG_NEXT_1,...,WFG_IDEXT_i,WFG_NEXT_i,...,WFG_IDEXT_RWV,WFG_NEXT_RWFGV,...,WFG_IDW,WFG_NW]。The participle characteristic vector W VE_RWFGV_TBI_EXT=[WFG_ID of extensionTBI_EXT_1,WFG_NTBI_EXT_1,...,WFG_IDTBI_EXT_i,WFG_NTBI_EXT_i,...,WFG_IDTBI_EXT_RWFGV_TBI,WFG_NTBI_EXT_RWFGV_TBI,...,WFG_IDW,WFG_NW]。
By the way, the dimension of the participle characteristic vector of the material in document to be identified and comparison database is all expanded to W, and by carrying out ascending order according to the numbering in participle storehouse or descending carries out unifying arrangement, thus the dimension of two participle characteristic vector characteristic of correspondence values is consistent。
Multilingual plagiarism identifies similarity calculation module, calculates the similarity between the arbitrary material in document to be identified and comparison database;Calculated by below equation:
S i m ( W V E _ R W F G V _ T B I , W V E _ R W F G V ) = S i m ( W V E _ R W F G V _ T B I _ E X T , W V E _ R W F G V _ E X T ) = 2 Σ i = 1 w W F G _ N T B I _ E X T _ i × W F G _ N E X T _ i Σ i = 1 w W F G _ N T B I _ E X T _ i 2 + Σ i = 1 w W F G _ N E X T _ i 2 + Σ i = 1 w W F G _ N T B I _ E X T _ i 2 × Σ i = 1 w W F G _ N E X T _ i 2
According to a specific embodiment of the present invention, for avoiding the dimension after extension too much, it is possible to using all of participle ID in participle characteristic vector W VE_RWV_TBI as a set;And the participle ID in WVE_RWV is gathered as another;Or using all of participle ID in participle stack features vector WVE_RWGV_TBI as a set;And the participle ID in WVE_RWGV is gathered as another;Or using all of participle ID in middle foreign language participle stack features vector WVE_RWFGV_TBI as a set;And the participle ID in WVE_RWFGV is gathered as another;Two set take union and obtain total participle ID set;According to total participle ID set, the dimension of the participle characteristic vector of the material in document to be identified and comparison database is extended, and participle ID corresponding for all eigenvalues is carried out ascending order or descending according to the numbering in participle storehouse, insert and total participle ID set comprises and originally self gathered the W_ID not comprisediValue, the participle numbering W_ID insertediCharacteristic of correspondence value is 0;Or insert and total participle group ID set comprises and originally self gathered the WG_ID not comprisediValue, the participle numbering WG_ID insertediCharacteristic of correspondence value is 0;Or insert and total middle foreign language participle group ID set comprises and originally self gathered the WFG_ID not comprisediValue, the participle numbering WFG_ID insertediCharacteristic of correspondence value is 0。
Access mode according to user, it is provided that in comparison database, the material of different word banks carries out similarity comparison, and comparison adopts the mode of traversal, the characteristic vector pickup of all materials being about in selected scope out, carries out similarity comparison with document to be identified;And calculated Similarity value and predetermined threshold are contrasted, when Similarity value is higher than predetermined threshold, corresponding material is recorded standby as doubtful material。
After document to be identified and all materials have contrasted, extract all doubtful materials, document to be identified and doubtful material are contrasted further。
According to a preferred embodiment of the invention, it is possible to be doubtful material by all material selectiong in proverb common saying storehouse, famous sayings of famous figures storehouse, poem storehouse。
According to a preferred embodiment of the invention, it is possible to be doubtful material by the participle free vector dimension WFV material selectiong simplifying vector dimension RWV less than participle。
According to a preferred embodiment of the invention, it is possible to be doubtful material by the participle group free vector dimension WGFV material selectiong simplifying vector dimension RWGV less than participle group。
According to a preferred embodiment of the invention, it is possible to be doubtful material by the middle foreign language participle group free vector dimension WFGFV material selectiong simplifying vector dimension RWFGV less than middle foreign language participle group。
According to a preferred embodiment of the invention, it is possible to choose doubtful material further by participle tightening coefficient。
According to a specific embodiment of the present invention, under common plagiarism qualification pattern, doubtful material can be screened according to the participle tightening coefficient of the participle tightening coefficient of document to be identified and material。Document tightening coefficient statistical module to be identified is according to participle tightening coefficient characteristic vector W GCVE_TBI=[W_ID corresponding to participle in this document to be identified, W_N, W_CHAR, G_W_ID_1, G_W_ID_2 ..., G_W_ID_i, ..., G_W_ID_ (W_N-1)] extract high density participle and the position of correspondence。Described document tightening coefficient statistical module to be identified, according to the participle part of speech W_CHAR in participle tightening coefficient characteristic vector, is chosen the participle that part of speech is notional word, and is added up the spacing participle total amount of predetermined adjacent quantity participle:Wherein n is predetermined adjacent quantity, when the spacing participle total amount of predetermined adjacent quantity participle is less than predetermined tight threshold T HGTime, then record the ID of this participle and corresponding position。
According to a specific embodiment of the present invention, extension is plagiarized can screen doubtful material according to the participle group tightening coefficient of the participle group tightening coefficient of document to be identified and material under qualification pattern。Document tightening coefficient statistical module to be identified is according to participle tightening coefficient characteristic vector W GGCVE_TBI=[WG_ID corresponding to participle group in this document to be identified, WG_N, WG_CHAR, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_i, ..., G_WG_ID_ (W_N-1)] extract high density participle group and the position of correspondence。Described document tightening coefficient statistical module to be identified, according to the participle group part of speech WG_CHAR in participle group tightening coefficient characteristic vector, is chosen the participle group that part of speech is notional word, and is added up the spacing participle total amount of predetermined adjacent quantity participle group:Wherein n is predetermined adjacent quantity, when the spacing participle total amount of predetermined adjacent quantity participle group is less than predetermined tight threshold T HGTime, then record the ID of this participle group and corresponding position。
According to a specific embodiment of the present invention, under multilingual plagiarism qualification pattern, doubtful material can be screened according to the middle foreign language participle group tightening coefficient of the middle foreign language participle group tightening coefficient of document to be identified and material。Document tightening coefficient statistical module to be identified is according to participle tightening coefficient characteristic vector W FGGCVE_TBI=[WFG_ID corresponding to foreign language participle group in this document to be identified, WFG_N, WFG_CHAR, G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_i, ..., G_WFG_ID_ (W_N-1)] extract high density participle group and the position of correspondence。Described document tightening coefficient statistical module to be identified according in participle group part of speech WFG_CHAR in foreign language participle group tightening coefficient characteristic vector, choose the participle group that part of speech is notional word, and add up the spacing participle total amount of predetermined adjacent quantity participle group:Wherein n is predetermined adjacent quantity, when the spacing participle total amount of predetermined adjacent quantity participle group is less than predetermined tight threshold T HGTime, then record the ID of foreign language participle group in this and corresponding position。
The value of described predetermined adjacent quantity n and tight threshold T HGPre-set by system, and can be adjusted according to actual needs;When the spacing participle total amount of predetermined adjacent quantity participle is less than predetermined tight threshold T HGTime, then it is believed that this notional word participle occurs comparatively intensive in relevant position, it is possible to concentrate and elaborate a certain viewpoint, it is necessary to emphasis is paid close attention to。
Under common plagiarism qualification pattern, the doubtful story extraction module of tightening coefficient, according to the spacing participle total amount of predetermined adjacent quantity participle less than predetermined tight threshold T HGTime, the participle ID recorded, extracts all materials comprising this participle ID in comparison database;Calculate participle tightening coefficient characteristic vector W GCVE=[W_ID, W_N, W_CHAR, G_W_ID_1 corresponding with this participle ID in material respectively, G_W_ID_2 ..., G_W_ID_i, ..., G_W_ID_ (W_N-1)], the spacing participle total amount of the predetermined adjacent quantity participle of statistics:Wherein n is predetermined adjacent quantity, when the spacing participle total amount of predetermined adjacent quantity participle is less than predetermined tight threshold T HGTime, then it is doubtful material by this material selectiong。Described participle ID is one or more, is one or more according to extracting, for one or more participle ID, the material comprising these one or more participle ID。
Extension is plagiarized under qualification pattern, the doubtful story extraction module of tightening coefficient, according to the spacing participle total amount of predetermined adjacent quantity participle group less than predetermined tight threshold T HGTime, the participle group ID recorded, extracts all materials comprising this participle ID group in comparison database;Calculate participle group tightening coefficient characteristic vector W GGCVE=[WG_ID corresponding with this participle group ID in material respectively, WG_N, WG_CHAR, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_i, ..., G_WG_ID_ (WG_N-1)], the spacing participle total amount of the predetermined adjacent quantity participle group of statistics:Wherein n is predetermined adjacent quantity, when the spacing participle group total amount of predetermined adjacent quantity participle is less than predetermined tight threshold T HGTime, then it is doubtful material by this material selectiong。Described participle group ID is one or more, is one or more according to extracting, for one or more participle group ID, the material comprising these one or more participle group ID。
Under multilingual plagiarism qualification pattern, the doubtful story extraction module of tightening coefficient, according to the spacing participle total amount of predetermined adjacent quantity China and foreign countries literary composition participle group less than predetermined tight threshold T HGTime, the middle foreign language participle group ID recorded, extracts and all in comparison database comprises the material of foreign language participle ID group in this;Calculate middle foreign language participle group tightening coefficient characteristic vector W FGGCVE=[WFG_ID corresponding with foreign language participle group ID in this in material respectively, WFG_N, WFG_CHAR, G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_i, ..., G_WFG_ID_ (WFG_N-1)], the spacing participle total amount of statistics predetermined adjacent quantity China and foreign countries literary composition participle group:Wherein n is predetermined adjacent quantity, and in the interval of predetermined adjacent quantity participle, foreign language participle group total amount is less than predetermined tight threshold T HGTime, then it is doubtful material by this material selectiong。Described middle foreign language participle group ID is one or more, is one or more according to extracting, for one or more middle foreign language participle group ID, the material comprising these one or more middle foreign language participle group ID。
By this extracting mode, it is possible to by this document to be identified, some occurs that total degree is not high, but it is likely to concentrate the notional word participle described and corresponding position to extract and carry out further comparison in some position。
According to a specific embodiment of the present invention, plagiarizing under qualification pattern at formula, formulas Extraction module, for the formula that will extract in document to be identified;Formula decomposing module, for extracting respective variable parameter and dependent variable parameter, operative symbol, the concrete meaning of each parameter, dimension and the span of formula respectively;Formula contrast module, for comparing the respective variable parameter of the formula extracted in document to be identified and dependent variable parameter, operative symbol, the concrete meaning of each parameter, dimension and span one by one with respective variable parameter and dependent variable parameter, operative symbol, the concrete meaning of each parameter, dimension and the span of the formula preserved in formula storehouse;When the respective variable parameter of formula preserved in the respective variable parameter of the formula in document to be identified and dependent variable parameter, operative symbol, dimension and span and formula storehouse and the registration of dependent variable parameter, operative symbol, dimension and span exceed formula comparison threshold T HMATHTime, using in formula storehouse with currently compared material that formula is associated as doubtful material。Described registration refer to the formula in document to be identified compared with the formula in formula storehouse, identical independent variable parameter, dependent variable parameter, operative symbol, dimensions number sum with document to be identified the ratio of the current independent variable parameter of formula, dependent variable parameter, operative symbol, dimensions number sum。
According to a specific embodiment of the present invention, it is possible to adopt sliding window that document to be identified and doubtful material carry out contrast in full。The size of sliding window can be configured by system。The size of sliding window directly affects contrast effect, and sliding window selects too small, easily causes erroneous judgement, and sliding window selects excessive, easily causes and fails to judge。The sliding step of sliding window is also pre-set by system。As in figure 2 it is shown, step S0: start;S1: sliding window arranges the similar window counter CT of module initialization1=0, slip long counter CT2=0;Step S2: sliding window arranges module and arranges the sliding window of document to be identified and doubtful material and be respectively positioned on document original position;Step S3: sliding window contrast module contrasts the sliding window of document to be identified and the sliding window of doubtful material, adds up the quantity of wherein identical notional word participle;Step S4: sliding window contrast module judges that whether the quantity of identical notional word participle is more than or equal to threshold T HW;When adding one more than or equal to threshold value hour counter value, i.e. CT1=CT1+ 1, and the content in the current position of the sliding window that records sliding window and the doubtful material of identifying document and sliding window;Step S5: sliding window arranges module and arranges sliding window one sliding step of slip of doubtful material;Step S6: sliding window arranges module and judges whether to be positioned at document end position place;If not end position, then return step S3: if end position, then go to step S11;Step S11: sliding window arranges module and judges whether the sliding window of document to be identified is positioned at document end position place;If not end position, then go to step S12, if end position, then go to step S13;Step S12: sliding window arranges module and arranges the sliding window of doubtful material and return to document original position;Sliding window one sliding step of slip of document to be identified, CT2=CT2+ 1 goes to step S3;Step S13: sliding window contrast module calculates similar window counter CT1Numerical value and slip long counter CT2The ratio M of numerical value;S14: sliding window contrast module judges that whether ratio M is more than or equal to predetermined threshold value THm, as M >=THMTime, then it is assumed that this document to be identified is similar to this doubtful material;As M < THMTime, then it is assumed that this document to be identified is dissimilar with this doubtful material;S15: sliding window contrast module judges whether that also doubtful material needs contrast, if it has, then return step S1;Without then going to step S16;Step S16: comparison report generation module generates and export comparison report, comprises the similar window counter CT of this qualification document and all similar doubtful materials in described comparison report1Numerical value, slip long counter CT2Numerical value, and both ratio, the particular location of this qualification document and similar doubtful material similar portion and particular content;Step S17: contrast terminates。
According to a specific embodiment of the present invention, step S3: sliding window contrast module contrasts the sliding window of document to be identified and the sliding window of doubtful material, adds up the quantity of wherein identical notional word participle;Wherein under common plagiarism qualification pattern, identical notional word participle refers to that notional word participle ID in participle storehouse is identical;Wherein plagiarizing under qualification pattern in extension, identical notional word participle refers to that notional word participle group ID in participle storehouse is identical;Wherein under multilingual plagiarism qualification pattern, identical notional word participle refers to that in notional word, foreign language participle group ID in participle storehouse is identical。
According to a specific embodiment of the present invention, step S16: comparison report generation module output comparison report, the content farther including comparison report is different according to the difference of the pattern of qualification。Under common plagiarism qualification pattern, comparison report comprises particular location and the particular content of this document to be identified and similar doubtful material similar portion;Document to be identified have employed the form of presentation that in the doubtful material similar with this, similar portion is consistent;Namely the word statement adopted is also completely the same;It is likely to only indivedual word orders adjusted;If its document plagiarized has been rewritten by identified document, when the degree of rewriting is bigger, common plagiarism identifies that pattern possibly cannot find its document plagiarized。Extension is plagiarized under qualification pattern, comprises particular location and the particular content of this document to be identified and similar doubtful material similar portion in comparison report;If its document plagiarized has been carried out synonym by identified document or near synonym are rewritten, when file structure is rewritten little, extension is plagiarized qualification pattern and is likely to also to find its document plagiarized。Under multilingual plagiarism qualification pattern, comparison report comprises particular location and the particular content of this document to be identified and similar doubtful material similar portion;Rewriting if its document plagiarized has been carried out translation by identified document, when file structure rewriting degree is little, extension is plagiarized qualification pattern and is likely to also to find its document plagiarized。
According to a specific embodiment of the present invention, sliding window is positioned at document original position and refers to that the leftmost side of sliding window overlaps with document original position;Sliding window is positioned at document end position and refers to that the rightmost side of sliding window overlaps with document end position。
According to system operation test in advance, it is comparatively suitable that sliding window is chosen as four notional word participle sizes, and the size of sliding window can also be chosen as other sizes as required。During contrast, sliding window slides the step-length of a notional word participle every time;In comparison process when occurring three in sliding window or more than three notional word participles are identical (being now left out the sequencing of notional word participle), then record this sliding window current location in document to be identified and doubtful material and content。
The above, it it is only presently preferred embodiments of the present invention, not the present invention is done any pro forma restriction, although the present invention is disclosed above with preferred embodiment, but it is not limited to the present invention, any those skilled in the art, without departing within the scope of technical solution of the present invention, when the technology contents of available the disclosure above makes a little change or is modified to the Equivalent embodiments of equivalent variations, in every case it is the content without departing from technical solution of the present invention, according to any simple modification that above example is made by the technical spirit of the present invention, equivalent variations and modification, all still fall within the scope of technical solution of the present invention。

Claims (10)

1. a sliding window document retrieval system, it is characterised in that including:
Comparison database, for including with the material comparing object;Described comparison database farther includes books storehouse, paper storehouse, patent database, formula storehouse, proverb common saying storehouse, proverb storehouse, famous sayings of famous figures storehouse, poem storehouse word bank;
Participle storehouse, is used for including participle and corresponding part of speech;Participle storehouse carries out unique number for each participle, uses W_ID to represent a certain participle unique number in participle storehouse;The participle part of speech classification that described participle storehouse preserves is noun, verb, adjective, number, measure word, pronoun, adverbial word, preposition, conjunction, auxiliary word, interjection and onomatopoeia;
Word-dividing mode, for each material carries out participle, and preserves word segmentation result to comparison database;Word segmentation result is compared by word-dividing mode with the part of speech that participle storehouse preserves, it is determined that the part of speech of word segmentation result;
Participle eigenvalue generation module adds up the quantity that each participle occurs in corresponding material, generate the participle part of speech eigenvalue WCCV=[W_ID that each participle is corresponding, W_N, W_CHAR], WCV=[W_ID, W_N], wherein W_ID represents this participle unique number in participle storehouse, and W_N represents the total degree that this participle occurs in this material;W_CHAR represents the part of speech of this participle;
Participle free vector dimension determines that module determines participle free vector dimension WFV according to the word segmentation result of material;Described participle free vector dimension WFV is equal to the quantity of the different participles obtained after specific material is carried out participle;
Vector dimension generation module simplified in participle, for the participle free vector dimension WFV of each material is simplified, generates participle and simplifies vector dimension RWV;
Participle feature vector generation module, simplifies vector dimension RWV characteristic of correspondence value generation participle characteristic vector W VE_RWV for simplifying participle described in the vector dimension RWV each material of extraction according to participle;
WVE_RWV=[W_ID1, W_N1 ..., W_IDi, W_Ni ..., W_IDRWV, W_NRWV]
Wherein W_IDi represents participle unique number in participle storehouse, W_Ni, represents the total degree that this participle occurs in this material, using this number of times eigenvalue as this participle;
User's access mode detection module, is used for pointing out user to upload document to be identified;
User detects mode decision module, is used for judging that active user detects pattern when being common plagiarism qualification pattern, and document word-dividing mode to be identified, for document to be identified is carried out participle, obtains word segmentation result;
Document participle free vector dimension to be identified determines module, determines participle free vector dimension WFV_TBI for the word segmentation result according to document to be identified;
Vector dimension generation module simplified in document participle to be identified, for the participle free vector dimension WFV_TBI of document to be identified is simplified;Generate document participle to be identified and simplify vector dimension RWV_TBI;
Document participle feature vector generation module to be identified, simplify according to participle and the vector dimension RWV_TBI each document to be identified of extraction is simplified with described document participle to be identified vector dimension RWV_TBI characteristic of correspondence value generation document participle characteristic vector W VE_RWV_TBI to be identified, wherein
WVE_RWV_TBI=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV_TBI,W_NRWV_TBI]
Wherein W_IDiRepresent participle unique number in participle storehouse, W_NiRepresent the total degree that this participle occurs in this document to be identified, using this number of times eigenvalue as this participle;
User detects mode decision module and judges that active user detects pattern when being common plagiarism qualification pattern, and when carrying out similarity comparison, document participle feature vector generation module to be identified generates the participle characteristic vector W VE_RWV_TBI of document to be identified;WVE_RWV_TBI=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV_TBI,W_NRWV_TBI], the dimension of the participle characteristic vector of document to be identified is RWV_TBI;Participle feature vector generation module generates the participle characteristic vector W VE_RWV of material in comparison database;WVE_RWV=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV,W_NRWV];Wherein, the dimension RWV_TBI of the participle characteristic vector of document to be identified is equal to the dimension RWV of participle characteristic vector;
File characteristics vector adjusting module to be identified, for by W_ID corresponding for all eigenvalues in participle characteristic vector W VE_RWV_TBIiValue carries out ascending order or descending the W_ID that will lack according to the numbering in participle storehouseiValue is inserted, the participle numbering W_ID of insertioniCorresponding eigenvalue is 0;The document participle characteristic vector W VE_RWV_TBI_EXT=[W_ID to be identified being expandedTBI_EXT_1,W_NTBI_EXT_1,...,W_IDTBI_EXT_i,W_NTBI_EXT_i,...,W_IDTBI_EXT_RWV_TBI,W_NTBI_EXT_RWV_TBI,...,W_IDW,W_NW];
Material characteristic vector adjusting module, for by W_ID corresponding for all eigenvalues in participle characteristic vector W VE_RWViValue carries out ascending order or descending the W_ID that will lack according to the numbering in participle storehouseiValue is inserted, the participle numbering W_ID of insertioniCorresponding eigenvalue is 0;The participle characteristic vector W VE_RWV_EXT=[W_ID being expandedEXT_1,W_NEXT_1,...,W_IDEXT_i,W_NEXT_i,...,W_IDEXT_RWV,W_NEXT_RWV,...,W_IDW,W_NW];
Common plagiarism identifies similarity calculation module, calculates the similarity between the arbitrary material in document to be identified and comparison database;Calculated by below equation:
S i m ( W V E _ R W V _ T B I , W V E _ R W V ) = S i m ( W V E _ R W V _ T B I _ E X T , W V E _ R W V _ E X T ) = 2 &Sigma; i = 1 w W _ N T B I _ E X T _ i &times; W _ N E X T _ i &Sigma; i = 1 w W _ N T B I _ E X T _ i 2 + &Sigma; i = 1 w W _ N E X T _ i 2 + &Sigma; i = 1 w W _ N T B I _ E X T _ i 2 &times; &Sigma; i = 1 w W _ N E X T _ i 2
After document to be identified and all materials have contrasted, extract all doubtful materials, adopt sliding window to contrast further in document to be identified and doubtful material。
2. sliding window document retrieval system according to claim 1, wherein said adopts sliding window to carry out contrast further particularly as follows: step S0 by document to be identified and doubtful material: start;S1: sliding window arranges the similar window counter CT of module initialization1=0, slip long counter CT2=0;Step S2: sliding window arranges module and arranges the sliding window of document to be identified and doubtful material and be respectively positioned on document original position;Step S3: sliding window contrast module contrasts the sliding window of document to be identified and the sliding window of doubtful material, adds up the quantity of wherein identical notional word participle;Step S4: sliding window contrast module judges that whether the quantity of identical notional word participle is more than or equal to threshold T HW;When adding one more than or equal to threshold value hour counter value, i.e. CT1=CT1+ 1, and the content in the current position of the sliding window that records sliding window and the doubtful material of identifying document and sliding window;Step S5: sliding window arranges module and arranges sliding window one sliding step of slip of doubtful material;Step S6: sliding window arranges module and judges whether to be positioned at document end position place;If not end position, then return step S3: if end position, then go to step S11;Step S11: sliding window arranges module and judges whether the sliding window of document to be identified is positioned at document end position place;If not end position, then go to step S12, if end position, then go to step S13;Step S12: sliding window arranges module and arranges the sliding window of doubtful material and return to document original position;Sliding window one sliding step of slip of document to be identified, CT2=CT2+ 1 goes to step S3;Step S13: sliding window contrast module calculates similar window counter CT1Numerical value and slip long counter CT2The ratio M of numerical value;S14: sliding window contrast module judges that whether ratio M is more than or equal to predetermined threshold value THm, as M >=THMTime, then it is assumed that this document to be identified is similar to this doubtful material;As M < THMTime, then it is assumed that this document to be identified is dissimilar with this doubtful material;S15: sliding window contrast module judges whether that also doubtful material needs contrast, if it has, then return step S1;Without then going to step S16;Step S16: comparison report generation module generates and export comparison report, comprises the similar window counter CT of this qualification document and all similar doubtful materials in described comparison report1Numerical value, slip long counter CT2Numerical value, and both ratio, the particular location of this qualification document and similar doubtful material similar portion and particular content;Step S17: contrast terminates。
3. sliding window document retrieval system according to claim 1 and 2, carries out contrast in full by document to be identified and doubtful material。
4. according to the arbitrary described sliding window document retrieval system of claim 1-3, wherein: participle is simplified vector dimension generation module and adopted part of speech screening method that participle free vector dimension WFV is simplified;Simplify process as follows: classified according to corresponding participle part of speech by the eigenvalue of word segmentation result;It is A1 class notional word eigenvalue, A2 class notional word eigenvalue, B class notional word eigenvalue, C class notional word eigenvalue, D class notional word eigenvalue and V class function word eigenvalue by feature value division;Add up the quantity of lower eigenvalue of all categories respectively;AMOUNT_A1, refer to the quantity of A1 class notional word eigenvalue, AMOUNT_A2, refer to the quantity of A2 class notional word eigenvalue, AMOUNT_B, refer to the quantity of B class notional word eigenvalue, the quantity of AMOUNT_C, C class notional word eigenvalue, the quantity of AMOUNT_D, D class notional word eigenvalue, the quantity of AMOUNT_V, V class notional word eigenvalue;Calculate participle and simplify the value RWV_S_V of vector dimension RWV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V);If greater than 0, if exiting and this time simplifying;If equal to 0, then complete this time to simplify;If less than 0, then calculate participle further and simplify the value RWV_S_D of vector dimension RWV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWV_S_D quantity from the eigenvalue corresponding to AMOUNT_V, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate participle further and simplify the value RWV_S_C of vector dimension RWV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWV_S_C quantity from the eigenvalue corresponding to AMOUNT_D, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate participle further and simplify the value RWV_S_B of vector dimension RWV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWV_S_B quantity from the eigenvalue corresponding to AMOUNT_C, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate participle further and simplify the value RWV_S_A2 of vector dimension RWV-(AMOUNT_A1+AMOUNT_A2);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWV_S_A2 quantity from the eigenvalue corresponding to AMOUNT_B, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate participle further and simplify the value RWV_S_A1 of vector dimension RWV-AMOUNT_A1;If greater than 0, then the eigenvalue that random extraction is equal with this difference RWV_S_A1 quantity from the eigenvalue corresponding to AMOUNT_A2, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then from the eigenvalue corresponding to AMOUNT_A1, extract the eigenvalue equal with simplifying vector dimension RWV quantity at random, complete this time to simplify。
5. sliding window document retrieval system according to claim 4, the value RWV_S_V of vector dimension RWV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V) situation more than 0 is simplified, using corresponding material as doubtful material for calculating participle。
6. a sliding window document detection method, it is characterised in that including:
Comparison database is included with the material comparing object;Described comparison database farther includes books storehouse, paper storehouse, patent database, formula storehouse, proverb common saying storehouse, proverb storehouse, famous sayings of famous figures storehouse, poem storehouse word bank;
Participle and corresponding part of speech are included in participle storehouse;Participle storehouse carries out unique number for each participle, uses W_ID to represent a certain participle unique number in participle storehouse;The participle part of speech classification that described participle storehouse preserves is noun, verb, adjective, number, measure word, pronoun, adverbial word, preposition, conjunction, auxiliary word, interjection and onomatopoeia;
Each material is carried out participle by word-dividing mode, and word segmentation result is preserved to comparison database;Word segmentation result is compared by word-dividing mode with the part of speech that participle storehouse preserves, it is determined that the part of speech of word segmentation result;
Participle eigenvalue generation module adds up the quantity that each participle occurs in corresponding material, generate the participle part of speech eigenvalue WCCV=[W_ID that each participle is corresponding, W_N, W_CHAR], WCV=[W_ID, W_N], wherein W_ID represents this participle unique number in participle storehouse, and W_N represents the total degree that this participle occurs in this material;W_CHAR represents the part of speech of this participle;
Participle free vector dimension determines that module determines participle free vector dimension WFV according to the word segmentation result of material;Described participle free vector dimension WFV is equal to the quantity of the different participles obtained after specific material is carried out participle;
Participle is simplified the vector dimension generation module participle free vector dimension WFV to each material and is simplified, and generates participle and simplifies vector dimension RWV;
Participle feature vector generation module is simplified participle described in the vector dimension RWV each material of extraction according to participle and is simplified vector dimension RWV characteristic of correspondence value generation participle characteristic vector W VE_RWV;
WVE_RWV=[W_ID1, W_N1 ..., W_IDi, W_Ni ..., W_IDRWV, W_NRWV]
Wherein W_IDi represents participle unique number in participle storehouse, W_Ni, represents the total degree that this participle occurs in this material, using this number of times eigenvalue as this participle;
User's access mode detection module prompting user uploads document to be identified;
User detects mode decision module and judges that active user detects pattern when being common plagiarism qualification pattern, and document word-dividing mode to be identified, for document to be identified is carried out participle, obtains word segmentation result;
Document participle free vector dimension to be identified determines that module determines participle free vector dimension WFV_TBI according to the word segmentation result of document to be identified;
Document participle to be identified is simplified the vector dimension generation module participle free vector dimension WFV_TBI to document to be identified and is simplified;Generate document participle to be identified and simplify vector dimension RWV_TBI;
Document participle feature vector generation module to be identified is simplified according to participle and is simplified vector dimension RWV_TBI characteristic of correspondence value generation document participle characteristic vector W VE_RWV_TBI to be identified in the vector dimension RWV_TBI each document to be identified of extraction with described document participle to be identified, wherein
WVE_RWV_TBI=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV_TBI,W_NRWV_TBI]
Wherein W_IDiRepresent participle unique number in participle storehouse, W_NiRepresent the total degree that this participle occurs in this document to be identified, using this number of times eigenvalue as this participle;
User detects mode decision module and judges that active user detects pattern when being common plagiarism qualification pattern, and when carrying out similarity comparison, document participle feature vector generation module to be identified generates the participle characteristic vector W VE_RWV_TBI of document to be identified;WVE_RWV_TBI=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV_TBI,W_NRWV_TBI], the dimension of the participle characteristic vector of document to be identified is RWV_TBI;Participle feature vector generation module generates the participle characteristic vector W VE_RWV of material in comparison database;WVE_RWV=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV,W_NRWV];Wherein, the dimension RWV_TBI of the participle characteristic vector of document to be identified is equal to the dimension RWV of participle characteristic vector;
File characteristics vector adjusting module to be identified is by W_ID corresponding for all eigenvalues in participle characteristic vector W VE_RWV_TBIiValue carries out ascending order or descending the W_ID that will lack according to the numbering in participle storehouseiValue is inserted, the participle numbering W_ID of insertioniCorresponding eigenvalue is 0;The document participle characteristic vector W VE_RWV_TBI_EXT=[W_ID to be identified being expandedTBI_EXT_1,W_NTBI_EXT_1,...,W_IDTBI_EXT_i,W_NTBI_EXT_i,...,W_IDTBI_EXT_RWV_TBI,W_NTBI_EXT_RWV_TBI,...,W_IDW,W_NW];
Material characteristic vector adjusting module is by W_ID corresponding for all eigenvalues in participle characteristic vector W VE_RWViValue carries out ascending order or descending the W_ID that will lack according to the numbering in participle storehouseiValue is inserted, the participle numbering W_ID of insertioniCorresponding eigenvalue is 0;The participle characteristic vector W VE_RWV_EXT=[W_ID being expandedEXT_1,W_NEXT_1,...,W_IDEXT_i,W_NEXT_i,...,W_IDEXT_RWV,W_NEXT_RWV,...,W_IDW,W_NW];
The common similarity plagiarized between the arbitrary material identified in similarity calculation module calculating document to be identified and comparison database;Calculated by below equation:
S i m ( W V E _ R W V _ T B I , W V E _ R W V ) = S i m ( W V E _ R W V _ T B I _ E X T , W V E _ R W V _ E X T ) = 2 &Sigma; i = 1 w W _ N T B I _ E X T _ i &times; W _ N E X T _ i &Sigma; i = 1 w W _ N T B I _ E X T _ i 2 + &Sigma; i = 1 w W _ N E X T _ i 2 + &Sigma; i = 1 w W _ N T B I _ E X T _ i 2 &times; &Sigma; i = 1 w W _ N E X T _ i 2
After document to be identified and all materials have contrasted, extract all doubtful materials, adopt sliding window to contrast further in document to be identified and doubtful material。
7. sliding window document detection method according to claim 6, wherein,
Described document to be identified and doubtful material adopt sliding window carry out contrast further particularly as follows: step S0: to start;S1: sliding window arranges the similar window counter CT of module initialization1=0, slip long counter CT2=0;Step S2: sliding window arranges module and arranges the sliding window of document to be identified and doubtful material and be respectively positioned on document original position;Step S3: sliding window contrast module contrasts the sliding window of document to be identified and the sliding window of doubtful material, adds up the quantity of wherein identical notional word participle;Step S4: sliding window contrast module judges that whether the quantity of identical notional word participle is more than or equal to threshold T HW;When adding one more than or equal to threshold value hour counter value, i.e. CT1=CT1+ 1, and the content in the current position of the sliding window that records sliding window and the doubtful material of identifying document and sliding window;Step S5: sliding window arranges module and arranges sliding window one sliding step of slip of doubtful material;Step S6: sliding window arranges module and judges whether to be positioned at document end position place;If not end position, then return step S3: if end position, then go to step S11;Step S11: sliding window arranges module and judges whether the sliding window of document to be identified is positioned at document end position place;If not end position, then go to step S12, if end position, then go to step S13;Step S12: sliding window arranges module and arranges the sliding window of doubtful material and return to document original position;Sliding window one sliding step of slip of document to be identified, CT2=CT2+ 1 goes to step S3;Step S13: sliding window contrast module calculates similar window counter CT1Numerical value and slip long counter CT2The ratio M of numerical value;S14: sliding window contrast module judges that whether ratio M is more than or equal to predetermined threshold value THm, as M >=THMTime, then it is assumed that this document to be identified is similar to this doubtful material;As M < THMTime, then it is assumed that this document to be identified is dissimilar with this doubtful material;S15: sliding window contrast module judges whether that also doubtful material needs contrast, if it has, then return step S1;Without then going to step S16;Step S16: comparison report generation module generates and export comparison report, comprises the similar window counter CT of this qualification document and all similar doubtful materials in described comparison report1Numerical value, slip long counter CT2Numerical value, and both ratio, the particular location of this qualification document and similar doubtful material similar portion and particular content;Step S17: contrast terminates。
8. the sliding window document detection method according to claim 6 or 7, carries out contrast in full by document to be identified and doubtful material。
9. according to the arbitrary described sliding window document detection method of claim 6-8, wherein: participle is simplified vector dimension generation module and adopted part of speech screening method that participle free vector dimension WFV is simplified;Simplify process as follows: classified according to corresponding participle part of speech by the eigenvalue of word segmentation result;It is A1 class notional word eigenvalue, A2 class notional word eigenvalue, B class notional word eigenvalue, C class notional word eigenvalue, D class notional word eigenvalue and V class function word eigenvalue by feature value division;Add up the quantity of lower eigenvalue of all categories respectively;AMOUNT_A1, refer to the quantity of A1 class notional word eigenvalue, AMOUNT_A2, refer to the quantity of A2 class notional word eigenvalue, AMOUNT_B, refer to the quantity of B class notional word eigenvalue, the quantity of AMOUNT_C, C class notional word eigenvalue, the quantity of AMOUNT_D, D class notional word eigenvalue, the quantity of AMOUNT_V, V class notional word eigenvalue;Calculate participle and simplify the value RWV_S_V of vector dimension RWV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V);If greater than 0, if exiting and this time simplifying;If equal to 0, then complete this time to simplify;If less than 0, then calculate participle further and simplify the value RWV_S_D of vector dimension RWV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWV_S_D quantity from the eigenvalue corresponding to AMOUNT_V, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate participle further and simplify the value RWV_S_C of vector dimension RWV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWV_S_C quantity from the eigenvalue corresponding to AMOUNT_D, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate participle further and simplify the value RWV_S_B of vector dimension RWV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWV_S_B quantity from the eigenvalue corresponding to AMOUNT_C, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate participle further and simplify the value RWV_S_A2 of vector dimension RWV-(AMOUNT_A1+AMOUNT_A2);If greater than 0, then the eigenvalue that random extraction is equal with this difference RWV_S_A2 quantity from the eigenvalue corresponding to AMOUNT_B, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then calculate participle further and simplify the value RWV_S_A1 of vector dimension RWV-AMOUNT_A1;If greater than 0, then the eigenvalue that random extraction is equal with this difference RWV_S_A1 quantity from the eigenvalue corresponding to AMOUNT_A2, complete this time to simplify;If equal to 0, then complete this time to simplify;If less than 0, then from the eigenvalue corresponding to AMOUNT_A1, extract the eigenvalue equal with simplifying vector dimension RWV quantity at random, complete this time to simplify。
10. sliding window document detection method according to claim 9, the value RWV_S_V of vector dimension RWV-(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V) situation more than 0 is simplified, using corresponding material as doubtful material for calculating participle。
CN201610020696.3A 2016-01-13 2016-01-13 A kind of sliding window document detection method and system Active CN105701086B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610020696.3A CN105701086B (en) 2016-01-13 2016-01-13 A kind of sliding window document detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610020696.3A CN105701086B (en) 2016-01-13 2016-01-13 A kind of sliding window document detection method and system

Publications (2)

Publication Number Publication Date
CN105701086A true CN105701086A (en) 2016-06-22
CN105701086B CN105701086B (en) 2018-06-01

Family

ID=56226264

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610020696.3A Active CN105701086B (en) 2016-01-13 2016-01-13 A kind of sliding window document detection method and system

Country Status (1)

Country Link
CN (1) CN105701086B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034717A (en) * 2018-06-05 2018-12-18 王振 The method of mark string bid behavior is enclosed in a kind of identification bidding process
CN109344403A (en) * 2018-09-20 2019-02-15 中南大学 A kind of document representation method of enhancing semantic feature insertion
CN112417876A (en) * 2020-11-23 2021-02-26 北京乐学帮网络技术有限公司 Text processing method and device, computer equipment and storage medium
CN112949298A (en) * 2021-02-26 2021-06-11 维沃移动通信有限公司 Word segmentation method and device, electronic equipment and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103226546A (en) * 2013-04-15 2013-07-31 北京邮电大学 Suffix tree clustering method on basis of word segmentation and part-of-speech analysis
JP2014026455A (en) * 2012-07-26 2014-02-06 Nippon Telegr & Teleph Corp <Ntt> Media data analysis device, method and program
CN104899304A (en) * 2015-06-12 2015-09-09 北京京东尚科信息技术有限公司 Named entity identification method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014026455A (en) * 2012-07-26 2014-02-06 Nippon Telegr & Teleph Corp <Ntt> Media data analysis device, method and program
CN103226546A (en) * 2013-04-15 2013-07-31 北京邮电大学 Suffix tree clustering method on basis of word segmentation and part-of-speech analysis
CN104899304A (en) * 2015-06-12 2015-09-09 北京京东尚科信息技术有限公司 Named entity identification method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
卓可秋 等: "一种基于Spark的论文相似性快速检测方法", 《图书情报工作》 *
赵春燕 等: "论文抄袭检测技术研究", 《科教导刊》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034717A (en) * 2018-06-05 2018-12-18 王振 The method of mark string bid behavior is enclosed in a kind of identification bidding process
CN109344403A (en) * 2018-09-20 2019-02-15 中南大学 A kind of document representation method of enhancing semantic feature insertion
CN112417876A (en) * 2020-11-23 2021-02-26 北京乐学帮网络技术有限公司 Text processing method and device, computer equipment and storage medium
CN112949298A (en) * 2021-02-26 2021-06-11 维沃移动通信有限公司 Word segmentation method and device, electronic equipment and readable storage medium
CN112949298B (en) * 2021-02-26 2022-10-04 维沃移动通信有限公司 Word segmentation method and device, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN105701086B (en) 2018-06-01

Similar Documents

Publication Publication Date Title
CN105701076A (en) Thesis plagiarism detection method and system
KR100717998B1 (en) Method for examining plagiarism of document
JP5086799B2 (en) Question answering method, apparatus, program, and recording medium recording the program
CN105701085A (en) Network duplicate checking method and system
Ranera et al. Retrieval of semantically similar Philippine supreme court case decisions using Doc2Vec
Argamon Computational forensic authorship analysis: Promises and pitfalls
Hussein Arabic document similarity analysis using n-grams and singular value decomposition
CN105701086A (en) Method and system for detecting literature through sliding window
Wadud et al. Text coherence analysis based on misspelling oblivious word embeddings and deep neural network
Reddy et al. N-gram approach for gender prediction
Rahman et al. NLP-based automatic answer script evaluation
Rosnelly The Similarity of Essay Examination Results using Preprocessing Text Mining with Cosine Similarity and Nazief-Adriani Algorithms
CN105677641A (en) Paper self-inspection method and system
Esteki et al. A Plagiarism Detection Approach Based on SVM for Persian Texts.
Rahman et al. An automated approach for answer script evaluation using natural language processing
CN105701077A (en) Multi-language literature detection method and system
CN105701213A (en) Literature comparison method and system
CN105550172A (en) Distributive text detection method and system
Flanagan et al. Classification of English language learner writing errors using a parallel corpus with SVM
Helgadóttir et al. Correcting Errors in a New Gold Standard for Tagging Icelandic Text.
Febriyanty et al. Hoax Detection News Using Naïve Bayes and Support Vector Machine Algorithm
Gashkov et al. Improving the question answering quality using answer candidate filtering based on natural-language features
Sheikh et al. Semi supervised method for detection of ambiguous word and creation of sense: Using WordNet
CN105701206A (en) Sampling based literature detection method and system
CN105701087A (en) Formula plagiarism detection method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant