CN105677641B - A kind of paper self checking method and system - Google Patents

A kind of paper self checking method and system Download PDF

Info

Publication number
CN105677641B
CN105677641B CN201610021493.6A CN201610021493A CN105677641B CN 105677641 B CN105677641 B CN 105677641B CN 201610021493 A CN201610021493 A CN 201610021493A CN 105677641 B CN105677641 B CN 105677641B
Authority
CN
China
Prior art keywords
user
word
test
participle
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610021493.6A
Other languages
Chinese (zh)
Other versions
CN105677641A (en
Inventor
夏峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201610021493.6A priority Critical patent/CN105677641B/en
Publication of CN105677641A publication Critical patent/CN105677641A/en
Application granted granted Critical
Publication of CN105677641B publication Critical patent/CN105677641B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation

Abstract

The invention provides a kind of paper self checking method and system.User's writing style similarity calculation module is used to calculate active user's writing style similarity, and user's writing style similarity judge module is by active user's writing style similarity SimT(USER) compared with self examination & verification thresholding of systemic presupposition;As user's writing style similarity SimT(USER) when higher than self examination & verification thresholding, you can think that the pending document of active user's submission and user's writing style are inconsistent;As user's writing style similarity SimT(USER) when less than self examination & verification thresholding, you can think that the pending document that active user submits is consistent with user's writing style.

Description

A kind of paper self checking method and system
Technical field
The invention belongs to text detection field, more particularly to a kind of paper self checking method and system.
Background technology
Paper plagiarizes detection and refers to judge whether a certain piece paper is accused of plagiarizing in the text of other one or more documents Hold.But not fully it is equal to duplication due to plagiarizing, but replaces or translate possibly through certain semantic transforms, synonym The multiple means such as foreign language document are accused of plagiarizing the content of text of other documents.
At present, paper, which plagiarizes detection technique, mainly two methods:One kind is by fingerprint recognition detection method, and one kind is logical Cross based on paragraph word frequency statisticses detection method in text.So-called fingerprint recognition refers to extract from the source text content of submission The referred to as data characteristics string of fingerprint, judges whether a certain piece document is copied to other documents according to the identical rate of fingerprint Attack.So-called paragraph word frequency statisticses detection method refers to segment the text of submission, by counting going out for each paragraph in text Existing frequency, set after a threshold value by each array of text to be checked compared with each array of query text, finally according to Accordingly index come judged whether to plagiarize.The above method of the prior art exist a certain degree of discrimination rate it is low, effect The problems such as rate is not high.
The content of the invention
To overcome above-mentioned the deficiencies in the prior art, the invention provides a kind of paper self checking method and system.
The invention provides a kind of paper self checking method and system.User's writing style similarity calculation module is used to calculate Active user's writing style similarity, user's writing style similarity judge module is by active user's writing style similarity SimT (USER) compared with self examination & verification thresholding of systemic presupposition;As user's writing style similarity SimT(USER) higher than described During self examination & verification thresholding, you can think that the pending document of active user's submission and user's writing style are inconsistent;When user writes Make style similarity SimT(USER) when less than self examination & verification thresholding, you can think the pending document that active user submits It is consistent with user's writing style.
Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention, And can be practiced according to the content of specification, below with presently preferred embodiments of the present invention and coordinate accompanying drawing describe in detail as after.
Brief description of the drawings
Fig. 1 shows the block diagram of paper self-checking system according to an embodiment of the invention;
Fig. 2 shows sliding window detection method according to an embodiment of the invention.
Embodiment
Further to illustrate the present invention to reach the technological means and effect that predetermined goal of the invention is taken, below in conjunction with Accompanying drawing and preferred embodiment, to according to system and method embodiment, feature and its effect proposed by the present invention, specifically It is bright as after.In the following description, what different " embodiment " or " embodiment " referred to is not necessarily same embodiment.This Outside, special characteristic, structure or the feature in one or more embodiments can be combined by any suitable form.
As shown in figure 1, include material subsystem in the paper self-checking system (calling system in the following text) of the present invention;User subsystem; Doubtful story extraction subsystem;Subsystem is contrasted, wherein the material subsystem, for preparing what is used for plagiarizing detection contrast Material;User subsystem, user management user login information, and determine user's writing style;Doubtful story extraction subsystem, For the extraction from comparison database and the doubtful material of document to be identified;Subsystem is contrasted, for by doubtful material and text to be identified Shelves are contrasted, and generate comparison report.
According to the specific embodiment of the present invention, material subsystem may further include:Comparison database;Segment storehouse, Participle includes synonymous near synonym storehouse and middle foreign language thesaurus in storehouse;Word-dividing mode;Participle group module;Middle foreign language participle group mould Block;Segment parts of speech classification module;Participle group parts of speech classification module;Middle foreign language participle group parts of speech classification module;Segment characteristic value life Into module;Participle group characteristic value generation module;Middle foreign language participle group characteristic value generation module;Segment tightening coefficient generation module; Participle group tightening coefficient generation module;Middle foreign language participle group tightening coefficient generation module;Segment the generation of tightening coefficient characteristic vector Module;Participle group tightening coefficient feature vector generation module;Middle foreign language participle group tightening coefficient feature vector generation module;Participle Free vector dimension determining module;Participle group free vector dimension determining module;Middle foreign language participle group free vector dimension determines Module;Participle simplifies vector dimension generation module;Participle group simplifies vector dimension generation module;Middle foreign language participle group simplifies vector Dimension generation module;Segment feature vector generation module;Participle group feature vector generation module;And middle foreign language participle group feature One or more of vector generation module.
According to the specific embodiment of the present invention, user subsystem may further include:User's access mode is examined Survey module;User's detection pattern determining module;User's writing style test module;Test pictures word description characteristic value generates mould Block;Test article word description characteristic value generation module;Test pictures word description feature vector generation module;Test article text Word description feature vector generation module;Test pictures reference characteristic vector generation module;Test the vector generation of article reference characteristic Module;User test picture character Expressive Features value generation module;User test picture character Expressive Features vector generation module; User's picture writing style feature vector generation module;User test article word description characteristic value generation module;User test Article word description feature vector generation module;User's article writing style and features vector generation module;User's writing style is special Levy vector generation module;Pending file characteristics value generation module;Pending file characteristics value tag vector generation module;User Writing style similarity calculation module;User's writing style judge module;In user's writing style structural auxiliary word judge module It is one or more.
According to the specific embodiment of the present invention, doubtful story extraction subsystem may further include:It is to be identified Document word-dividing mode;Document participle group module to be identified;Foreign language participle group module in document to be identified;Document to be identified segments word Property sort module;Document participle group parts of speech classification module to be identified;Foreign language participle group parts of speech classification module in document to be identified;Treat Identify document participle characteristic value generation module;Document participle group characteristic value generation module to be identified;Foreign language point in document to be identified Phrase characteristic value generation module;Document to be identified segments tightening coefficient generation module;Document participle group tightening coefficient life to be identified Into module;Foreign language participle group tightening coefficient generation module in document to be identified;Document to be identified segments tightening coefficient characteristic vector Generation module;Document participle group tightening coefficient feature vector generation module to be identified;Foreign language participle group is close in document to be identified Coefficient characteristics vector generation module;Document to be identified segments free vector dimension determining module;Document participle group to be identified is free Vector dimension determining module;Foreign language participle group free vector dimension determining module in document to be identified;Document participle essence to be identified Simple vector dimension generation module;Document participle group to be identified simplifies vector dimension generation module;Foreign language segments in document to be identified Group simplifies vector dimension generation module;Document to be identified segments feature vector generation module;Document participle group feature to be identified to Measure generation module;Foreign language participle group feature vector generation module in document to be identified;File characteristics vector adjusting module to be identified; Material characteristic vector adjusting module;Common to plagiarize identification similarity calculation module, identification similarity calculation module is plagiarized in extension;It is more Languages plagiarize identification similarity calculation module;Document tightening coefficient statistical module to be identified;Material tightening coefficient statistical module;It is public Formula extraction module;Formula decomposing module;One or more of doubtful story extraction module of tightening coefficient.
According to the specific embodiment of the present invention, contrast subsystem may further include:Sliding window sets mould Block;Sliding window contrast module and comparison report generation module.
According in the specific embodiment party of the present invention, the system includes comparison database, for including with comparing object Material.The comparison database further comprises books storehouse, paper storehouse, patent database, formula storehouse, proverb common saying storehouse, proverb storehouse, famous person The word banks such as well-known saying storehouse, poem storehouse.Wherein, books storehouse is used for the books for including public publication;Paper storehouse be used for include journal article, Meeting paper, academic dissertation etc.;Patent database is used to include disclosure etc.., it is necessary to further preserve institute when including material State the source of material, such as the publication date of books, publishing house, author, book number etc.;The date issued of journal article, corresponding phase The periodical name of periodical, issue, author etc.;The meeting title of meeting paper, Meeting Held place, Meeting Held date, author etc.;Degree The school of paper, graduate time, degree grade, author etc.;According to the quarry information included, those skilled in the art can Uniquely to obtain the material.Preferably, the material that comparison database is included is not limited to Chinese material, further comprises foreign language element Material.Comparison database establish after also need to periodically or non-periodically be safeguarded, supplement newly-increased books, journal article, meeting paper, Academic dissertation and disclosure etc..Proverb common saying storehouse be used for be embodied in sentence wide-spread between network or masses, The materials such as phrase.Famous sayings of famous figures storehouse is used to include famous sayings of famous figures material, and poem storehouse is used to include the materials such as poem, word, song, tax. The purpose that proverb common saying storehouse, famous sayings of famous figures storehouse, poem storehouse etc. are further established in comparison database is by the material of object as a comparison Scope further expands from traditional books, paper, patent file etc., improves and plagiarizes the comprehensive of detection.People in the art Member knows that comparison database can also further include other kinds of material, will not be repeated here.
Preferably, comparison database is classified when including material according to material art.According to one of present invention tool Body embodiment, field designation can use the classification in Chinese library taxonomy, the Chinese library taxonomy totally 5 Basic category, 22 major classes, the mixing number combined using Chinese phonetic alphabet with Arabic numerals, one is represented with a letter Individual major class, alphabetically reflect the order of major class, marked after letter with numeral.For example, A1 represents Marx, Engels Works, K6 represent Oceania history, and TN represents electronic technology, the communication technology.To be applicable industrial technology development, to the two of industrial technology Level classification uses biliteral.Those skilled in the art know, other taxonomic hierarchieses can also be used to carry out field mark to material Know.
Preferably, comparison database is when including material, to the material included according to title, author, summary and text Mode is indexed respectively.For establishing incidence relation between the title of each material, author, summary and text each several part, The remainder of same material can be obtained by any portion therein.
Preferably, comparison database is when including material, carries out extraction duplication to formula present in the material included, and build Vertical formula storehouse is individually preserved.Each formula in the formula storehouse established with its material being extracted it is relevant, Its corresponding material can be obtained in full by the formula in formula storehouse.According to the specific embodiment of the present invention, receiving When recording formula, respective variable parameter and the dependent variable parameter and oeprator of formula are subjected to extraction preservation respectively.According to The specific embodiment of the present invention, the respective laggard onestep extraction of variable parameter and dependent variable parameter for extracting formula are each Concrete meaning, dimension and the span of parameter, and preserved respectively.According to the specific embodiment of the present invention, After the oeprator for extracting formula, middle foreign language textual annotation is further subject to operator.In formula storehouse, that is included is every One formula preserves the symbol expression of each self-corresponding independent variable parameter and dependent variable parameter, each independent variable, dependent variable The middle foreign language statement of concrete meaning, dimension and the middle foreign language textual annotation of span and operator AND operator.Right Purpose than further establishing formula storehouse in storehouse is that the material scope of object as a comparison is further expanded into formula contrast, is carried Height plagiarizes the comprehensive of detection.Those skilled in the art know that comparison database can also further be entered to the other guide in material Row extraction, such as chemical formula, gene order etc., will not be repeated here.
According to the specific embodiment of the present invention, the comparison database is stored in different websites using distributed way Position;Particular station can be chosen when accessing comparison database according to the loading condition of different websites to conduct interviews.Each station statistics are current The material quantity being extracted in unit interval from comparison database, the material quantity can be the number or material of material Byte number;Obtain the average load amount of this website;The average load amount of this website is periodically reported doubtful material by each website Extract subsystem;When the doubtful story extraction subsystem needs extract material from comparison database to be used to choose doubtful material, A minimum website of average load amount is chosen according to the average load amount of each website reported recently to conduct interviews;List therein The position period is configured by system;It can be chosen for 5 minutes, 10 minutes, 30 minutes or 60 minutes according to being actually needed.Root According to the specific embodiment of the present invention, different word banks can be stored in different stations using distributed way in the comparison database Point position;The site location deposited according to different word banks during comparison database is accessed to conduct interviews respectively.Doubtful story extraction subsystem System, which needs to extract material from comparison database, to be used for when choosing doubtful material, according to the art or affiliated to be extracted material Type, different contrast word banks is selected to conduct interviews.
According to the specific embodiment of the present invention, comprising storehouse is segmented in system, for including participle and corresponding part of speech. The participle storehouse is set in advance by system, and periodic maintenance, is mended and is increased neologisms etc..Preferably, segment storehouse in for it is each segment into Row unique number, W_ID can be used to represent unique number of a certain participle in storehouse is segmented.Preserve participle in the participle storehouse Part of speech, such as noun, verb, adjective, number, measure word, pronoun, adverbial word, preposition, conjunction, auxiliary word, interjection and onomatopoeia. According to the specific embodiment of the present invention, word segmentation result is divided into by notional word and function word according to part of speech, wherein notional word includes Noun, verb, adjective, number, measure word and pronoun;Function word includes adverbial word, preposition, conjunction, auxiliary word, interjection, onomatopoeia.It is preferred that Ground, segment in storehouse and further included synonymous near synonym storehouse, wherein the same or like participle of implication is formed into one group, using group as Unit is numbered.Multiple equivalent in meaning or similar participle corresponds to a participle group #, can represent certain using WG_ID Unique number of one participle in storehouse is segmented.Preferably, segment in storehouse and further included the synonymous near synonym storehouse of middle foreign language, wherein The same or like middle foreign language participle of implication is formed one group, is numbered in units of group.It is multiple equivalent in meaning or similar Middle foreign language participle corresponds to a middle foreign language participle group #, can represent that a certain middle foreign language participle group is segmenting using WFG_ID Unique number in storehouse.
According to the specific embodiment of the present invention, word-dividing mode is included in system, for being segmented to each material, And word segmentation result is preserved into comparison database.Preferably, word-dividing mode is compared the part of speech that word segmentation result preserves with participle storehouse It is right, determine the part of speech of word segmentation result.Preferably, segment parts of speech classification module according to corresponding to word segmentation result part of speech to word segmentation result Carry out classification processing.
According to the specific embodiment of the present invention, participle group module is included in system, for dividing each material Word, and participle group result is preserved into comparison database.Preferably, the part of speech that participle group module preserves word segmentation result with participle storehouse It is compared, it is determined that the part of speech of participle group result.Preferably, participle group parts of speech classification module word according to corresponding to participle group result Property carries out classification processing to participle group result.
According to the specific embodiment of the present invention, middle foreign language participle group module is included in system, for each material Segmented, and middle foreign language participle group result is preserved into comparison database.Preferably, middle foreign language participle group module divides middle foreign language Word result with participle storehouse preserve part of speech be compared, it is determined that in foreign language participle group result part of speech.Preferably, middle foreign language participle Part of speech centering foreign language participle group result carries out classification processing to group parts of speech classification module corresponding to foreign language participle group result according in.
According to the specific embodiment of the present invention, participle parts of speech classification module, participle group parts of speech classification module and Middle foreign language participle group parts of speech classification module respectively divides word segmentation result, participle group result and middle foreign language participle group according to part of speech For A classes notional word, B classes notional word, C classes notional word, D classes notional word and V class function words, wherein A classes notional word includes noun;B class notional words include Verb, adjective;C classes notional word includes number, measure word;D classes notional word includes pronoun;V classes function word includes adverbial word, preposition, conjunction, helped Word, interjection, onomatopoeia.Preferably, segment in storehouse and noun is further divided into technical term and common noun.According to this hair A bright embodiment, word segmentation result is divided into by A1 classes notional word, A2 classes notional word, B classes notional word, C classes reality according to part of speech Word, D classes notional word and V class function words, wherein A1 classes notional word include technical term noun;A2 classes notional word includes common noun;B classes are real Word includes verb, adjective;C classes notional word includes number, measure word;D classes notional word includes pronoun;V classes function word include adverbial word, preposition, Conjunction, auxiliary word, interjection, onomatopoeia.Those skilled in the art can choose different classification processing schemes according to being actually needed.
According to the specific embodiment of the present invention, participle characteristic value generation module counts each participle in corresponding element The quantity occurred in material, generates participle characteristic value WCV=[W_ID, W_N] corresponding to each participle, and wherein W_ID represents this point Unique number of the word in storehouse is segmented, W_N represent the total degree that the participle occurs in the material.Preferably, it is contemplated that each The part of speech of individual participle, participle characteristic value generation module generation participle part of speech feature value WCCV=[W_ID, W_N, W_CHAR], wherein W_CHAR represents the part of speech of the participle.
According to the specific embodiment of the present invention, participle group characteristic value generation module counts each participle group right The quantity occurred in material is answered, generates participle group characteristic value WGCV=[WG_ID, WG_N] corresponding to each participle group, wherein WG_ID represents unique number of the participle group in storehouse is segmented, and WG_N represents the total degree that the participle group occurs in the material. Preferably, it is contemplated that the part of speech of each participle group, participle group characteristic value generation module generation participle group part of speech feature value WGCCV =[WG_ID, WG_N, WG_CHAR], wherein WG_CHAR represent the part of speech of the participle group.
According to the specific embodiment of the present invention, middle foreign language participle group characteristic value generation module counts each China and foreign countries The quantity that literary participle group occurs in corresponding material, generates participle group characteristic value WFGCV corresponding to foreign language participle group in each =[WFG_ID, WFG_N], wherein WFG_ID represent unique number of the foreign language participle group in storehouse is segmented in this, and WFG_N is represented should The total degree that middle foreign language participle group occurs in the material.Preferably, it is contemplated that the part of speech of foreign language participle group in each, participle Foreign language participle group part of speech feature value WFGCCV=[WFG_ID, WFG_N, WFG_CHAR] in the generation module generation of group characteristic value, its Middle WFG_CHAR represents the part of speech of foreign language participle group in this.
According to the specific embodiment of the present invention, participle tightening coefficient generation module is used to generate the close system of participle Number.The participle tightening coefficient refers to that same participle is adjacent in whole material and occurs be spaced participle quantity twice.According to The specific embodiment of the present invention, participle tightening coefficient is expressed as WGC=[G_W_ID_1, G_ corresponding to each participle W_ID_2 ..., G_W_ID_ (W_N-1)], wherein, G_W_ID_1 represents that the participle occurs and second for the first time in the material The participle quantity being spaced between appearance, G_W_ID_2 represent that the participle occurs occurring it with third time second in the material Between the participle quantity that is spaced, G_W_ID_ (W_N-1) represents that the participle the W_N-1 times appearance in the material goes out with the W_N times The participle quantity being spaced between existing;G_W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N-1) are to divide corresponding to the participle Word tightening coefficient.According to the specific embodiment of the present invention, participle tightening coefficient feature vector generation module generation participle Tightening coefficient characteristic vector W GCVE=[W_ID, W_N, W_CHAR, G_W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N-1)], Wherein W_ID represents unique number of the participle in storehouse is segmented, and W_N represents that the participle of the specific participle in the material is always secondary Number, W_CHAR represent the part of speech of the participle.By segmenting tightening coefficient, entirety of the specific participle in corresponding material can be known Distribution situation.
According to the specific embodiment of the present invention, participle group tightening coefficient generation module is close for generating participle group Coefficient.The participle group tightening coefficient refers to that same participle group is adjacent in whole material and occurs be spaced participle number twice Amount.According to the specific embodiment of the present invention, participle group tightening coefficient is expressed as WGGC=corresponding to each participle group [G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1)], wherein, G_WG_ID_1 represents the participle group in the material The participle quantity that middle first time occurs and is spaced between occurring for second, G_WG_ID_2 represent the participle group in the material Second of the participle quantity occurred being spaced between occurring for the third time, G_WG_ID_ (WG_N-1) represent the participle group in the element The participle quantity being spaced in material between the WG_N-1 times appearance and the WG_N times appearance;G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1) is participle group tightening coefficient corresponding to the participle group.According to the specific embodiment party of the present invention Formula, participle group tightening coefficient feature vector generation module generation participle group tightening coefficient characteristic vector W GGCVE=[WG_ID, WG_ N, WG_CHAR, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1)], wherein WG_ID represents that the participle group is being divided Unique number in dictionary, WG_N represent the participle total degree of the specific participle group in the material, and WG_CHAR represents the participle The part of speech of group.By participle group tightening coefficient, overall distribution situation of the specific participle group in corresponding material can be known.
According to the specific embodiment of the present invention, middle foreign language participle group tightening coefficient generation module is used to generate China and foreign countries Literary participle group tightening coefficient.The middle foreign language participle group tightening coefficient refers to that same middle foreign language participle group is adjacent in whole material Occurs be spaced participle quantity twice.According to the specific embodiment of the present invention, foreign language participle group is corresponding in each Middle foreign language participle group tightening coefficient be expressed as WFGGC=[G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N- 1)], wherein, G_WFG_ID_1 represents that foreign language participle group occurs between second of appearance between institute for the first time in the material in this Every participle quantity, between G_WFG_ID_2 represents in this that foreign language participle group occurs for second in the material and third time occurs The participle quantity being spaced, G_WFG_ID_ (WFG_N-1) represent that foreign language participle group goes out for the WFG_N-1 times in the material in this The participle quantity being spaced between now occurring with the WFG_N times;G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_ N-1) it is participle group tightening coefficient corresponding to foreign language participle group in this.According to the specific embodiment of the present invention, China and foreign countries Foreign language participle group tightening coefficient characteristic vector W FGGCVE=in literary participle group tightening coefficient feature vector generation module generation [WFG_ID, WFG_N, WFG_CHAR, G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-1)], wherein WFG_ ID represents unique number of the foreign language participle group in storehouse is segmented in this, and WFG_N represents the specific middle foreign language participle group in the material In participle total degree, WFG_CHAR represents the part of speech of foreign language participle group in this., can be with by middle foreign language participle group tightening coefficient Know overall distribution situation of the specific middle foreign language participle group in corresponding material.
According to the specific embodiment of the present invention, participle knot of the free vector dimension determining module according to material is segmented Fruit determines participle free vector dimension WFV;The participle free vector dimension WFV is equal to specific material is segmented after obtain Different participles quantity.When the length of material is shorter or word segmentation result therein is less, resulting participle freely to It is less to measure dimension WFV;When the length of material is longer or word segmentation result therein is more, resulting participle free vector dimension Number WFV is more.
According to the specific embodiment of the present invention, participle group free vector dimension determining module is according to the participle of material As a result participle group free vector dimension WGFV is determined;The participle group free vector dimension WGFV is equal to and specific material is divided The quantity of the different participle groups obtained after word.It is resulting when the length of material is shorter or participle group result therein is less Participle group free vector dimension WGFV it is less;When the length of material is longer or participle group result therein is more, gained The participle group free vector dimension WGFV arrived is more.
According to the specific embodiment of the present invention, middle foreign language participle group free vector dimension determining module is according to material Word segmentation result determine in foreign language participle group free vector dimension WFGFV;The middle foreign language participle group free vector dimension WFGFV Equal to the quantity of foreign language participle group in the difference obtained after being segmented to specific material.When the length of material is shorter or wherein Middle foreign language participle group result it is less when, resulting middle foreign language participle group free vector dimension WFGFV is less;When a piece for material Width is longer or when participle group result therein is more, and resulting middle foreign language participle group free vector dimension WFGFV is more.
According to the specific embodiment of the present invention, participle is simplified vector dimension generation module and is used for each material Participle free vector dimension WFV is simplified, and generation participle simplifies vector dimension RWV.It is described participle simplify vector dimension RWV by System is specified.Preferably, system specifies participle to simplify vector dimension RWV as 500.Preferably, system specifies participle to simplify vector Dimension RWV is 800.Preferably, system specifies participle to simplify vector dimension RWV as 1000.
According to the specific embodiment of the present invention, participle simplifies vector dimension generation module and uses extracted at equal intervals method Participle free vector dimension WFV is simplified.It is as follows to simplify process:Judge whether participle free vector dimension WFV is more than to divide Word simplifies vector dimension RWV, if it is, participle free vector dimension WFV divided by the system participle specified are simplified into vectorial dimension Number RWV, and upper rounding operation is carried out to resulting quotient, further obtain simplifying coefficients R EDU;Then in participle free vector At interval of one characteristic value of REDU-1 extraction in characteristic value corresponding to dimension WFV;After all characteristics extractions, sentence Whether the quantity of disconnected extracted characteristic value equal to participle simplifies vector dimension RWV;When the quantity for the characteristic value extracted is equal to When participle simplifies vector dimension RWV, then complete participle free vector dimension WFV and simplify;When the quantity for the characteristic value extracted is small When participle simplifies vector dimension RWV, then calculate participle and simplify vector dimension RWV and the difference of characteristic value quantity;Do not carried Random extraction simplifies the vector dimension RWV characteristic values equal with the difference quantities of characteristic value with participle in the characteristic value taken, completes Participle free vector dimension WFV's simplifies.
According to the specific embodiment of the present invention, participle simplifies vector dimension generation module and uses part of speech screening method pair Participle free vector dimension WFV is simplified.It is as follows to simplify process:By the characteristic value of word segmentation result according to corresponding participle part of speech Classified;It is A1 class notional words characteristic value, A2 classes notional word spy by feature value division according to the specific embodiment of the present invention Value indicative, B class notional words characteristic value, C class notional words characteristic value, D class notional word characteristic values and V class function word characteristic values.Generally, it is considered that notional word Effect played in the similarity comparison of corresponding characteristic value is bigger, and wherein technical term noun can more embody than common noun Effective content of material.Count respectively lower eigenvalue of all categories quantity AMOUNT_A1 (quantity of A1 class notional word characteristic values), (C classes are real by AMOUNT_A2 (quantity of A2 class notional word characteristic values), AMOUNT_B (quantity of B class notional word characteristic values), AMOUNT_C The quantity of word characteristic value), AMOUNT_D (quantity of D class notional word characteristic values), AMOUNT_V (quantity of V class notional word characteristic values). Calculate participle and simplify vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+ AMOUNT_V value RWV_S_V);If greater than 0, exit and if this time simplify;If equal to 0, then complete this time to simplify;Such as Fruit is less than 0, then further calculates participle and simplify vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_ C+AMOUNT_D value RWV_S_D);It is if greater than 0, then random from the characteristic value corresponding to AMOUNT_V to extract and the difference The equal characteristic value of RWV_S_D quantity, completion are this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then enter One step calculates the value RWV_S_ that participle simplifies vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C) C;It is if greater than 0, then random from the characteristic value corresponding to AMOUNT_D to extract the feature equal with difference RWV_S_C quantity Value, completion are this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then further calculate participle and simplify vectorial dimension Number RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B) value RWV_S_B;If greater than 0, then from corresponding to AMOUNT_C Characteristic value in the random extraction characteristic value equal with difference RWV_S_B quantity, complete this time to simplify;If equal to 0, then it is complete Simplified into this;If less than 0, then further calculate participle and simplify vector dimension RWV-'s (AMOUNT_A1+AMOUNT_A2) Value RWV_S_A2;It is if greater than 0, then random from the characteristic value corresponding to AMOUNT_B to extract and difference RWV_S_A2 quantity Equal characteristic value, completion are this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then participle is further calculated Simplify vector dimension RWV-AMOUNT_A1 value RWV_S_A1;If greater than 0, then from the characteristic value corresponding to AMOUNT_A2 The random extraction characteristic value equal with difference RWV_S_A1 quantity, completion are this time simplified;If equal to 0, then complete this time essence Letter;If less than 0, then extracted at random from the characteristic value corresponding to AMOUNT_A1 equal with simplifying vector dimension RWV quantity Characteristic value, completion are this time simplified.
Vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+ are simplified for calculating participle AMOUNT_D+AMOUNT_V value RWV_S_V) is more than 0 situation, that is, means that the material length is smaller or information content is less, Therefore be not suitable for being contrasted using characteristic value.
Participle free vector dimension WFV represents that itself dimension is small when simplifying vector dimension RWV less than participle, then other are tieed up Value under several is equivalent to 0.Such a situation needs Direct Mark in systems, individually includes processing.Such as common saying among the people, famous person Well-known saying etc., search and use as index.Subsequently usable full text sliding window, which compare in full, to be used.
According to the specific embodiment of the present invention, participle group is simplified vector dimension generation module and is used for each material Participle group free vector dimension WGFV simplified, generation participle group simplify vector dimension RWGV.The participle group simplify to Amount dimension RWGV is specified by system.Preferably, system specifies participle group to simplify vector dimension RWGV as 500.Preferably, system refers to Determine participle group and simplify vector dimension RWGV as 800.Preferably, system specifies participle group to simplify vector dimension RWGV as 1000.
According to the specific embodiment of the present invention, participle group simplifies vector dimension generation module and uses extracted at equal intervals Method is simplified to participle group free vector dimension WGFV.It is as follows to simplify process:Judging participle group free vector dimension WGFV is It is no to simplify vector dimension RWGV more than participle group, divide if it is, participle group free vector dimension WGFV divided by system are specified Phrase simplifies vector dimension RWGV, and carries out upper rounding operation to resulting quotient, further obtains simplifying coefficients R EDU;Then At interval of one characteristic value of REDU-1 extraction in the characteristic value corresponding to participle group free vector dimension WGFV;As all spies After value indicative is extracted, judge whether the quantity of extracted characteristic value equal to participle group simplifies vector dimension RWGV;When being carried When the quantity of the characteristic value taken simplifies vector dimension RWGV equal to participle group, then participle group free vector dimension WGFV essences are completed Letter;When the quantity for the characteristic value extracted simplifies vector dimension RWGV less than participle group, then calculate participle group and simplify vectorial dimension Number RWGV and the difference of characteristic value quantity;Random extraction simplifies vector dimension RWGV with participle group in the characteristic value being not extracted by The characteristic value equal with the difference quantities of characteristic value, complete simplifying for participle group free vector dimension WGFV.
According to the specific embodiment of the present invention, participle group simplifies vector dimension generation module and uses part of speech screening method Participle group free vector dimension WGFV is simplified.It is as follows to simplify process:Characteristic value is carried out according to corresponding participle part of speech Classification;It is A1 class notional words characteristic value, A2 class notional word features by feature value division according to the specific embodiment of the present invention Value, B class notional words characteristic value, C class notional words characteristic value, D class notional word characteristic values and V class function word characteristic values.Generally, it is considered that notional word pair Effect played in the similarity comparison for the characteristic value answered is bigger, and wherein technical term noun can more embody element than common noun Effective content of material.Count respectively lower eigenvalue of all categories quantity AMOUNT_A1 (quantity of A1 class notional word characteristic values), (C classes are real by AMOUNT_A2 (quantity of A2 class notional word characteristic values), AMOUNT_B (quantity of B class notional word characteristic values), AMOUNT_C The quantity of word characteristic value), AMOUNT_D (quantity of D class notional word characteristic values), AMOUNT_V (quantity of V class notional word characteristic values). Calculate participle group and simplify vector dimension RWGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+ AMOUNT_V value RWGV_S_V);If greater than 0, exit and if this time simplify;If equal to 0, then complete this time to simplify; If less than 0, then further calculate participle group and simplify vector dimension RWGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+ AMOUNT_C+AMOUNT_D value RWGV_S_D);If greater than 0, then extracted at random from the characteristic value corresponding to AMOUNT_V The characteristic value equal with difference RWGV_S_D quantity, completion are this time simplified;If equal to 0, then complete this time to simplify;It is if small In 0, then further calculate participle and simplify vector dimension RWGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C) Value RWGV_S_C;It is if greater than 0, then random from the characteristic value corresponding to AMOUNT_D to extract and difference RWGV_S_C numbers Equal characteristic value is measured, completion is this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then further calculate and divide Phrase simplifies vector dimension RWGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B) value RWGV_S_B;If greater than 0, then The random extraction characteristic value equal with difference RWGV_S_B quantity from the characteristic value corresponding to AMOUNT_C, completes this time essence Letter;If equal to 0, then complete this time to simplify;If less than 0, then further calculate participle group and simplify vector dimension RWGV- (AMOUNT_A1+AMOUNT_A2) value RWGV_S_A2;It is if greater than 0, then random from the characteristic value corresponding to AMOUNT_B The extraction characteristic value equal with difference RWGV_S_A2 quantity, completion are this time simplified;If equal to 0, then complete this time to simplify; If less than 0, then the value RWGV_S_A1 that participle group simplifies vector dimension RWGV-AMOUNT_A1 is further calculated;If greater than 0, then the random extraction characteristic value equal with difference RWGV_S_A1 quantity from the characteristic value corresponding to AMOUNT_A2, is completed This is simplified;If equal to 0, then complete this time to simplify;It is if less than 0, then random from the characteristic value corresponding to AMOUNT_A1 The characteristic value equal with simplifying vector dimension RWGV quantity is extracted, completion is this time simplified.
Vector dimension RWGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C are simplified for calculating participle group + AMOUNT_D+AMOUNT_V) value RWGV_S_V be more than 0 situation, that is, mean that the material length is smaller or information content compared with It is few, therefore be not suitable for being contrasted using characteristic value.
Participle group free vector dimension WGFV represents that itself dimension is small when simplifying vector dimension RWGV less than participle group, then Value under other dimensions is equivalent to 0.Such a situation needs Direct Mark in systems, individually includes processing.Such as custom among the people Language, famous sayings of famous figures etc., search and use as index.Subsequently usable full text sliding window, which compare in full, to be used.
According to the specific embodiment of the present invention, middle foreign language participle group is simplified vector dimension generation module and is used for every The middle foreign language participle group free vector dimension WFGFV of individual material is simplified, and foreign language participle group simplifies vector dimension in generation RWFGV.The middle foreign language participle group is simplified vector dimension RWFGV and specified by system.Preferably, foreign language participle group during system is specified Vector dimension RWFGV is simplified as 500.Preferably, foreign language participle group simplifies vector dimension RWFGV as 800 during system is specified.It is preferred that Ground, foreign language participle group simplifies vector dimension RWFGV as 1000 during system is specified.
According to the specific embodiment of the present invention, middle foreign language participle group is simplified between vector dimension generation module use etc. Simplified every extraction method centering foreign language participle group free vector dimension WFGFV.It is as follows to simplify process:Foreign language participle group in judgement Whether free vector dimension WFGFV more than middle foreign language participle group simplifies vector dimension RWFGV, if it is, middle foreign language is segmented Foreign language participle group simplifies vector dimension RWFGV during group free vector dimension WFGFV divided by system are specified, and to resulting quotient Rounding operation is carried out, further obtains simplifying coefficients R EDU;Then corresponding to middle foreign language participle group free vector dimension WFGFV Characteristic value at interval of REDU-1 extraction one characteristic value;After all characteristics extractions, extracted spy is judged Whether the quantity of value indicative equal to middle foreign language participle group simplifies vector dimension RWFGV;In the quantity for the characteristic value extracted is equal to When foreign language participle group simplifies vector dimension RWFGV, then foreign language participle group free vector dimension WFGFV is simplified in completing;When being carried When the quantity of the characteristic value taken simplifies vector dimension RWFGV less than middle foreign language participle group, then calculate in foreign language participle group simplify to Measure dimension RWFGV and the difference of characteristic value quantity;Random extraction is simplified with middle foreign language participle group in the characteristic value being not extracted by Characteristic value equal with the difference quantities of characteristic value vector dimension RWFGV, foreign language participle group free vector dimension WFGFV in completion Simplify.
According to the specific embodiment of the present invention, middle foreign language participle group simplifies vector dimension generation module and uses part of speech Screening method centering foreign language participle group free vector dimension WFGFV is simplified.It is as follows to simplify process:By characteristic value according to corresponding Participle part of speech is classified;It is A1 class notional words characteristic value, A2 by feature value division according to the specific embodiment of the present invention Class notional word characteristic value, B class notional words characteristic value, C class notional words characteristic value, D class notional word characteristic values and V class function word characteristic values.Generally Think, the effect played in the similarity comparison of characteristic value corresponding to notional word is bigger, and wherein technical term noun is than generic name Word can more embody effective content of material.Quantity AMOUNT_A1 (the A1 class notional word characteristic values of lower eigenvalue of all categories are counted respectively Quantity), AMOUNT_A2 (quantity of A2 class notional word characteristic values), AMOUNT_B (quantity of B class notional word characteristic values), AMOUNT_ C (quantity of C class notional word characteristic values), AMOUNT_D (quantity of D class notional word characteristic values), AMOUNT_V be (V class notional word characteristic values Quantity).Foreign language participle group simplifies vector dimension RWFGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_ in calculating C+AMOUNT_D+AMOUNT_V value RWFGV_S_V);If greater than 0, exit and if this time simplify;If equal to 0, then it is complete Simplified into this;If less than 0, then foreign language participle group simplifies vector dimension RWFGV- (AMOUNT_A1+ in further calculating AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D value RWFGV_S_D);It is if greater than 0, then right from AMOUNT_V institutes The random extraction characteristic value equal with difference RWFGV_S_D quantity, completion are this time simplified in the characteristic value answered;If equal to 0, Then complete this time to simplify;If less than 0, then foreign language participle group simplifies vector dimension RWFGV- (AMOUNT_A1 in further calculating + AMOUNT_A2+AMOUNT_B+AMOUNT_C) value RWFGV_S_C;If greater than 0, then from the feature corresponding to AMOUNT_D The random extraction characteristic value equal with difference RWFGV_S_C quantity, completion are this time simplified in value;If equal to 0, then complete this It is secondary to simplify;If less than 0, then foreign language participle group simplifies vector dimension RWFGV- (AMOUNT_A1+AMOUNT_ in further calculating A2+AMOUNT_B value RWFGV_S_B);It is if greater than 0, then random from the characteristic value corresponding to AMOUNT_C to extract and be somebody's turn to do The equal characteristic value of difference RWFGV_S_B quantity, completion are this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then further calculate the value RWFGV_S_A2 that participle group simplifies vector dimension RWFGV- (AMOUNT_A1+AMOUNT_A2);Such as Fruit is more than 0, then the random extraction feature equal with difference RWFGV_S_A2 quantity from the characteristic value corresponding to AMOUNT_B Value, completion are this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then foreign language participle group is smart in further calculating Simple vector dimension RWFGV-AMOUNT_A1 value RWFGV_S_A1;If greater than 0, then from the characteristic value corresponding to AMOUNT_A2 In the random extraction characteristic value equal with difference RWFGV_S_A1 quantity, complete this time to simplify;If equal to 0, then complete this It is secondary to simplify;It is if less than 0, then random from the characteristic value corresponding to AMOUNT_A1 to extract and simplify vector dimension RWFGV quantity Equal characteristic value, completion are this time simplified.
Vector dimension RWFGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+ are simplified for foreign language participle group in calculating AMOUNT_C+AMOUNT_D+AMOUNT_V value RWFGV_S_V) be more than 0 situation, that is, mean the material length it is smaller or Information content is less, therefore is not suitable for being contrasted using characteristic value.
Participle group free vector dimension WFGFV represents that itself dimension is small when simplifying vector dimension RWFGV less than participle group, Then the value under other dimensions is equivalent to 0.Such a situation needs Direct Mark in systems, individually includes processing.It is such as among the people Common saying, famous sayings of famous figures etc., search and use as index.Subsequently usable full text sliding window, which compare in full, to be used.
According to the specific embodiment of the present invention, participle feature vector generation module simplifies vector dimension according to participle RWV extracts participle described in each material and simplifies characteristic value generation participle characteristic vector W VE_RWV corresponding to vector dimension RWV;
WVE_RWV=[W_ID1, W_N1 ..., W_IDi, W_Ni ..., W_IDRWV, W_NRWV]
Wherein W_IDi represents unique number of the participle in storehouse is segmented, and W_Ni, represents what the participle occurred in the material Total degree, the characteristic value using the number as the participle.
According to the specific embodiment of the present invention, participle group feature vector generation module simplifies vector according to participle group Dimension RWGV extract participle group described in each material simplify characteristic value corresponding to vector dimension RWGV generate participle group feature to Measure WVE_RWGV;
WVE_RWGV=[WG_ID1, WG_N1 ..., WG_IDi, WG_Ni ..., WG_IDRWGV, WG_NRWGV]
Wherein WG_IDi represents unique number of the participle group in storehouse is segmented, and WG_Ni represents the participle group in the material The total degree of appearance, the characteristic value using the number as the participle group.
According to the specific embodiment of the present invention, middle foreign language participle group feature vector generation module foreign language point in Phrase simplifies middle foreign language participle group described in each material of vector dimension RWFGV extractions and simplifies spy corresponding to vector dimension RWFGV Foreign language participle group characteristic vector W VE_RWFGV in value indicative generation;
WVE_RWFGV=[WFG_ID1, WFG_N1 ..., WFG_IDi, WFG_Ni ..., WFG_IDRWFGV, WFG_ NRWFGV]
Unique number of the foreign language participle group in storehouse is segmented during wherein WFG_IDi is represented, WFG_Ni represent foreign language point in this The total degree that phrase occurs in the material, the characteristic value using the number as foreign language participle group in this.
According to the specific embodiment of the present invention, system provides the user a variety of access modes.User accesses system, User's access mode detection module is used for the access mode for detecting active user.
In the specific embodiment of the present invention, user can access system in a manner of on probation, referred to hereinafter as with probation The user that mode accesses is user on probation.When user's access mode detection module, which detects user, to be accessed in a manner of on probation, Prompting is sent to user on probation, it is mode on probation to inform current accessed mode, and informs the access right of user on probation.According to this One embodiment of invention, for the user accessed in a manner of on probation, system is only that user on probation provides book character Several detections are tried out, and the predetermined number of words is set in advance by system.According to the present invention another embodiment, for The user that mode on probation accesses, the database that system provides part or all of scope to try out user are tried out for detection.According to this The another embodiment of invention, for the user accessed in a manner of on probation, system is the plagiarism inspection that user on probation provides Survey result only provides the prompting of plagiarism rate, does not provide specific plagiarism position and with being contrasted by the plagiarism of plagiarism document.According to The another embodiment of the present invention, for the user accessed in a manner of on probation, system is the plagiarism that user on probation provides Testing result provide it is specific plagiarize position, but pair with carrying out Fuzzy processing by the plagiarism contrast of plagiarism document so that try out User is only capable of knowing the specific plagiarism position of the document itself provided, but None- identified is by the specifying information of plagiarism document.
According to the specific embodiment of the present invention, user accesses system with counting mode, referred to hereinafter as with counting mode The user of access is counting user.When user's access mode detection module, which detects user, to be accessed with counting mode, to meter Number user sends prompting, and it is counting mode to inform current accessed mode, and prompts counting user to upload needs and carry out plagiarism contrast Document.According to the specific embodiment of the present invention, system statistics counts the number of characters that user uploads document, and according to system The number of characters counted out calculates the expense that this text plagiarizes detection.According to the another embodiment of the present invention, system is The database that counting user provides part or all of scope is selective, and system selects different database scopes according to user is counted Calculate the expense that this text plagiarizes detection.
According to the specific embodiment of the present invention, user accesses system with timing mode, referred to hereinafter as with timing mode The user of access is timing user.When user's access mode detection module, which detects user, to be accessed with timing mode, to meter When user send prompting, it is timing mode to inform current accessed mode, and prompts timing user current residual to use duration.According to The another embodiment of the present invention, for timing user, system is timing user on display circle in use Residue is provided in face in real time to prompt using duration countdown.According to the another embodiment of the present invention, system is timing The database that user provides part or all of scope is selective.According to the specific embodiment of the present invention, system is according to meter When user select the number of characters of different database scope and timing user institute uploading detection document, estimate needed for the document Duration is detected, and prompts timing user remaining using whether duration can complete current plagiarism detection.
It is true by user's detection pattern after timing user logs in the system according to the specific embodiment of the present invention Cover half block determines to plagiarize detection detection pattern.According to the specific embodiment of the present invention, system provide self audit mode, It is selective commonly to plagiarize identification pattern, extension plagiarism identification pattern, multilingual plagiarism identification pattern, formula plagiarism identification pattern.
According to the specific embodiment of the present invention, user's detection pattern determining module determines active user's detection pattern For self audit mode when, user's writing style test module provides the user one or more test pictures, is being advised by user The word description no less than regulation number of words is carried out online for test pictures in fixing time.Preferably, user's writing style is tested Module further provides the user one or more test articles, is carried out being no less than regulation word online at the appointed time by user Several text reviews.The test pictures or test article from test picture library and test library by user's writing style test module In randomly select.No matter use test pictures or test article, be required for carrying out online word description or comment by user, by Being limited to the stipulated time can not set long, generally be chosen for 30 minutes or 60 minutes, corresponding word description or text reviews Regulation number of words is generally chosen for 400 word/30 minute or 800 word/60 minute.Those skilled in the art can be further as needed Other stipulated times or regulation number of words are set.From the point of view of experimental data, it is specified that the time should not set it is long, to avoid user from not having There are enough time or unstable networks can not complete accordingly to test;In addition, it is specified that the ratio of number of words and stipulated time are unsuitable too low, To avoid strictly according to the facts reflecting that user writes custom.Long, corresponding word description or text can not be set by being limited to the stipulated time The length of word comment is limited, and the only characteristic value and characteristic vector of the word description with on-line testing extraction or text reviews may Also the writing custom of user can not truly be reflected, it is therefore desirable to which further extraction test pictures describe reference characteristic vector and surveyed Examination article describe reference characteristic vector, for correct word description or text reviews word deficiency caused by feature to Measure deviation.
According to the specific embodiment of the present invention, the every width test pictures tested in picture library all have test chart chip base Quasi- characteristic vector.It is the base that predetermined quantity is randomly selected from different background crowds that the test pictures, which describe reference characteristic vector, Quasi- tester, the description no less than regulation number of words is carried out with regard to fc-specific test FC picture respectively, gathers all word descriptions, counted The test pictures word description characteristic value of same test pictures, according to the test pictures word description characteristic value calculate feature to Amount, and characteristic vector is weighted, obtain the test pictures reference characteristic vector of fc-specific test FC picture.The weighting fortune Weights in calculation are set by system.The every test article tested in library all has test article reference characteristic vector.It is described It is the benchmark test personnel that predetermined quantity is randomly selected from different background crowds to test article reference characteristic vector, just special respectively Location survey examination article carries out the description no less than regulation number of words, gathers all word descriptions, statistics is for same test article Test article word description characteristic value, characteristic vector calculated according to the test article word description characteristic value, and to feature to Amount is weighted, and obtains the test article reference characteristic vector of fc-specific test FC article.Weights in the ranking operation by System is set.
According to the specific embodiment of the present invention, the benchmark that predetermined quantity is randomly selected from different background crowds is surveyed It during examination personnel, can be chosen according to all ages and classes level, can preferably be divided into 20 years old with the following group, 20-29 year group, 30-39 year Group, 40-49 year group, more than 50 years old group.So as to collect the crowd of age groups for same test pictures or same test text Description situation of the chapter no less than regulation number of words.
According to the specific embodiment of the present invention, the benchmark that predetermined quantity is randomly selected from different background crowds is surveyed It during examination personnel, can be chosen according to different academic backgrounds level, it is large with the following group, undergraduate education group can be preferably divided into undergraduate education Scholar postgraduate's group, doctoral candidate's group.So as to collect the crowd of different academic backgrounds group for same test pictures or same test text Description situation of the chapter no less than regulation number of words.
According to the specific embodiment of the present invention, the benchmark that predetermined quantity is randomly selected from different background crowds is surveyed During examination personnel, can be chosen according to different majors field (can divide professional domain, herein not according to different measuring accuracy demands Repeat again), so as to collect the crowd of different majors field group for same test pictures or same test article no less than regulation The description situation of number of words.
According to the specific embodiment of the present invention, test pictures word description characteristic value generation module obtains benchmark and surveyed The test pictures that examination personnel obtain benchmark test personnel describe text, generate user test picture character Expressive Features value;It is described Test pictures word description characteristic value includes but is not limited to:Chinese number of words, foreign language number of words, total word number, notional word number, function word number, section Fall number, bout length distribution situation, sentence number, sentence length distribution situation, synonym, near synonym spread scenarios, function word use Situation, punctuation mark service condition, part of speech service condition.According to the specific embodiment of the present invention, Chinese number of words refers to The Chinese character number included in each test pictures word description in addition to punctuation mark, each word of Chinese are designated as a word Symbol;Foreign language number of words refers to the foreign language number of characters included in each test pictures word description in addition to punctuation mark, foreign language Each word is designated as a character;Total word number refers to the word sum obtained after being segmented to each test pictures word description, its The participle storehouse that system can be used to carry for middle Chinese word segmentation is segmented, and foreign language can be according to foreign language writing style, directly using per word Between space segmented;Notional word number obtains often after referring to participle according to word segmentation result compared with segmenting the part of speech in storehouse Notional word quantity in one test pictures word description, notional word number can be further divided into Chinese notional word number and foreign language notional word number, its In, the summation of Chinese notional word number and foreign language notional word number is equal to notional word number;Function word number refers to after segmenting according to word segmentation result and participle Part of speech in storehouse is compared to obtain the function word quantity in each test pictures word description, during further function word number can be divided into Literary function word number and foreign language function word number, wherein, the summation of Chinese function word number and foreign language function word number is equal to function word number;Paragraph number refers to often Paragraph quantity in one test pictures word description;Bout length distribution situation refers in each test pictures word description Word number and sentence number included in each paragraph;Sentence number refers to the sentence number in each test pictures word description Amount;Sentence length distribution situation refers to the word number included in each sentence in each test pictures word description;Synonym, Near synonym spread scenarios refer to the word segmentation result in each test pictures word description being compared with synonymous near synonym storehouse, The same or like participle of implication is formed into a set, calculates the word quantity in each set, thus reflects that this tests The synonym of the author of picture character description, near synonym writing custom, if wherein included in synonym or near synonym set Word number it is more, show that the writing style of the author tends to extend using synonym or near synonym, if synonym or nearly justice Word number included in set of words is fewer, shows that the writing style of the author tends to not use synonym or near synonym to extend; Function word service condition refers to the statistical conditions that function word uses in each test pictures word description, including but not limited to each piece The statistics ranking that function word uses in test pictures word description, the word number being each spaced between different function words, each identical function word Between the word number that is spaced;Such as " ", " ", the service condition of " obtaining " three structural auxiliary words can also be further counted, thus Reflect whether the author of this test pictures word description distinguishes use for " ", " ", " obtaining " three structural auxiliary words;Mark Point symbol service condition refers to the statistical conditions that punctuation mark uses in each test pictures word description, includes but is not limited to The statistics ranking that punctuate uses in each test pictures word description, the word number being each spaced between different punctuation marks, often The word number being spaced between individual identical punctuation mark;Part of speech service condition refers to after segmenting according to word segmentation result and the word in participle storehouse Property be compared to obtain the statistical conditions of each part of speech participle in each test pictures word description, such as respectively obtain noun, Verb, adjective, number, measure word, pronoun, adverbial word, preposition, conjunction, auxiliary word, the quantity of interjection and onomatopoeia, and each part of speech Quantity and the ratio of the total word number of this test pictures word description.
According to the specific embodiment of the present invention, test pictures word description characteristic value generation module is according to test chart Piece word description characteristic value generates test pictures word description characteristic vector.According to the specific embodiment of the present invention, by System specifies the dimension of the test pictures word description characteristic vector, and particular content every in characteristic vector and row The order of row.When the dimension of the characteristic vector of the test pictures word description is n, TPCVE=[TPC_ are represented by 1 ..., TPC_m ..., TPC_n], wherein, TPC_1 be test pictures word description characteristic vector in the first entry value, TPC_m For the m entry value in the characteristic vector of test pictures word description, TPC_n is in the characteristic vector of test pictures word description N-th entry value.
Preferably, the test pictures word description characteristic vector includes one or more in the following:Middle word The ratio of number and total word number, foreign language number of words and the ratio of total word number, the ratio of notional word number and total word number, function word number and total word number Ratio, the ratio of total word number and paragraph number, most long paragraph word number, synonym, the ratio of near synonym spreading number and total word number, punctuate Symbol is using the ratio of number and total word number, the ratio of noun number and total word number, the ratio of verb number and total word number, adjective number with The ratio of total word number, the ratio of number number and total word number, the ratio of measure word number and total word number, the ratio of pronoun number and total word number, The ratio of adverbial word number and total word number, the ratio of preposition number and total word number, the ratio of conjunction number and total word number, auxiliary word number and total word number Ratio, the ratio of interjection number and total word number, the ratio of onomatopoeia number and total word number.
According to the specific embodiment of the present invention, test pictures reference characteristic vector generation module statistics is for same The test pictures word description characteristic vector of test;Test pictures word description characteristic vector is weighted, obtains spy Location survey attempts piece benchmark characteristic vector, and the weights used in the ranking operation are set by system.Preferably, test pictures benchmark Feature vector generation module can be directed to age groups, academic group and professional domain group, count the test of predetermined quantity respectively Picture character Expressive Features vector, and be weighted respectively, obtain each age group, each academic group and each professional domain group Fc-specific test FC picture reference characteristic vector.
Fc-specific test FC picture reference characteristic vector can be expressed as:
Wherein TPCVE_ID represents the test pictures reference characteristic vector that numbering is ID;Tester's quantity on the basis of k; TPC_1iRepresent the first entry value of the characteristic vector of i-th of benchmark test personnel;TPC_miRepresent i-th benchmark test personnel's The m entry value of characteristic vector;TPC_niRepresent the n-th entry value of the characteristic vector of i-th of benchmark test personnel;W1,iFor TPC_1i's Weight coefficient;Wm,iFor TPC_miWeight coefficient;Wn,,iFor TPC_niWeight coefficient.
According to the specific embodiment of the present invention, test article word description characteristic value generation module obtains benchmark and surveyed The test article that examination personnel obtain benchmark test personnel describes text, generates user test article word description characteristic value;It is described Test article word description characteristic value includes but is not limited to:Chinese number of words, foreign language number of words, total word number, notional word number, function word number, section Fall number, bout length distribution situation, sentence number, sentence length distribution situation, synonym, near synonym spread scenarios, function word use Situation, punctuation mark service condition, part of speech service condition.According to the specific embodiment of the present invention, Chinese number of words refers to The Chinese character number included in each test article word description in addition to punctuation mark, each word of Chinese are designated as a word Symbol;Foreign language number of words refers to the foreign language number of characters included in each test article word description in addition to punctuation mark, foreign language Each word is designated as a character;Word number refers to the word sum obtained after being segmented to each test article word description, wherein The participle storehouse that carries of system can be used to be segmented for Chinese word segmentation, foreign language can according to foreign language writing style, directly using often word it Between space segmented;Notional word number refers to be obtained compared with segmenting the part of speech in storehouse according to word segmentation result after participle each Notional word quantity in piece test article word description, notional word number can be further divided into Chinese notional word number and foreign language notional word number, wherein, The summation of Chinese notional word number and foreign language notional word number is equal to notional word number;Function word number refers to after segmenting according to word segmentation result with segmenting in storehouse Part of speech be compared to obtain function word quantity in each test article word description, further function word number can be divided into Chinese void Word number and foreign language function word number, wherein, the summation of Chinese function word number and foreign language function word number is equal to function word number;Paragraph number refers to each piece The paragraph quantity tested in article word description;Bout length distribution situation refers to each in each test article word description Word number and sentence number included in paragraph;Sentence number refers to the sentence quantity in each test article word description;Sentence Sub- distribution of lengths situation refers to the word number included in each sentence in each test article word description;Synonym, nearly justice Word spread scenarios refer to each word segmentation result tested in article word description being compared with synonymous near synonym storehouse, will contain The same or like participle of justice forms a set, calculates the word quantity in each set, thus reflects that this tests article The synonym of the author of word description, near synonym writing custom, if wherein word included in synonym or near synonym set Number is more, shows that the writing style of the author tends to extend using synonym or near synonym, if synonym or near synonym collection Word number included in conjunction is fewer, shows that the writing style of the author tends to not use synonym or near synonym to extend;Function word Service condition refers to the statistical conditions that function word uses in each test article word description, including but not limited to each test The statistics ranking that function word uses in article word description, the word number being each spaced between different function words, each between identical function word The word number at interval;Such as " ", " ", the service condition of " obtaining " three structural auxiliary words can also be further counted, thus reflect Go out this and test the author of article word description and whether distinguish use for " ", " ", " obtaining " three structural auxiliary words;Punctuate accords with Number service condition refers to the statistical conditions that punctuation mark uses in each test article word description, including but not limited to each The statistics ranking that punctuate uses in piece test article word description, the word number being each spaced between different punctuation marks, Mei Gexiang With the word number being spaced between punctuation mark;Part of speech service condition is entered after referring to participle according to word segmentation result and the part of speech in participle storehouse Row relatively obtains the statistical conditions of each part of speech participle in each test article word description, for example, respectively obtain noun, verb, Adjective, number, measure word, pronoun, adverbial word, preposition, conjunction, auxiliary word, the quantity of interjection and onomatopoeia, and each part of speech quantity with This tests the ratio of the total word number of article word description.
According to the specific embodiment of the present invention, test article word description characteristic value generation module is according to test text Chapter word description characteristic value generates test pictures word description characteristic vector.According to the specific embodiment of the present invention, by System specifies the dimension of the test article word description characteristic vector, and particular content every in characteristic vector and row The order of row.When the dimension of the characteristic vector of the test article word description is n, TTCVE=[TTC_ are represented by 1 ..., TTC_m ..., TTC_n], wherein, TTC_1 be test pictures word description characteristic vector in the first entry value, TTC_m For the m entry value in the characteristic vector of test pictures word description, TTC_n is in the characteristic vector of test pictures word description N-th entry value.
Preferably, the test article word description characteristic vector includes one or more in the following:Middle word The ratio of number and total word number, foreign language number of words and the ratio of total word number, the ratio of notional word number and total word number, function word number and total word number Ratio, the ratio of total word number and paragraph number, most long paragraph word number, synonym, the ratio of near synonym spreading number and total word number, punctuate Symbol is using the ratio of number and total word number, the ratio of noun number and total word number, the ratio of verb number and total word number, adjective number with The ratio of total word number, the ratio of number number and total word number, the ratio of measure word number and total word number, the ratio of pronoun number and total word number, The ratio of adverbial word number and total word number, the ratio of preposition number and total word number, the ratio of conjunction number and total word number, auxiliary word number and total word number Ratio, the ratio of interjection number and total word number, the ratio of onomatopoeia number and total word number.
According to the specific embodiment of the present invention, test article reference characteristic vector generation module statistics is for same The test article word description characteristic vector of test;Test article word description characteristic vector is weighted, obtains spy Location survey examination article reference characteristic is vectorial, and the weights used in the ranking operation are set by system.Preferably, article benchmark is tested Feature vector generation module can be directed to age groups, academic group and professional domain group, count the test of predetermined quantity respectively Article word description characteristic vector, and be weighted respectively, obtain each age group, each academic group and each professional domain group Fc-specific test FC article reference characteristic vector.
Certain articles reference characteristic vector can be expressed as:
Wherein TTCVE_ID represents the test article reference characteristic vector that numbering is ID;Tester's quantity on the basis of k; TTC_1iRepresent the first entry value of the characteristic vector of i-th of benchmark test personnel;TTC_miRepresent i-th benchmark test personnel's The m entry value of characteristic vector;TTC_niRepresent the n-th entry value of the characteristic vector of i-th of benchmark test personnel;W1,iFor TPC_1i's Weight coefficient;Wm,iFor TPC_miWeight coefficient;Wn,,iFor TPC_niWeight coefficient.
According to the specific embodiment of the present invention, test pictures word description characteristic vector is retouched with test article word State the dimension of characteristic vector, and the wherein implication of each characteristic value and putting in order is consistent.For example, survey can be set It is Chinese number of words to attempt piece word description characteristic vector with testing the Section 1 characteristic value in article word description characteristic vector With the ratio of total word number, Section 2 characteristic value is the ratio of foreign language number of words and total word number, and Section 3 characteristic value is notional word number With the ratio of total word number, Section 4 characteristic value is the ratio of function word number and total word number, Section 5 characteristic value be total word number with The ratio of paragraph number, Section 6 characteristic value are most long paragraph word number, and Section 7 characteristic value is synonym, near synonym spreading number With the ratio of total word number, Section 8 characteristic value is ratio of the punctuation mark using number and total word number, and Section 9 characteristic value is The ratio of noun number and total word number, Section 10 characteristic value is the ratio of verb number and total word number, and Section 11 characteristic value is The ratio of adjective number and total word number, Section 12 characteristic value are the ratio of number number and total word number, Section 13 characteristic value It is the ratio of measure word number and total word number, Section 14 characteristic value is the ratio of pronoun number and total word number, Section 15 Xiang Te Value indicative is the ratio of adverbial word number and total word number, and Section 16 characteristic value is the ratio of preposition number and total word number, Section 17 Characteristic value is the ratio of conjunction number and total word number, and Section 18 characteristic value is the ratio of auxiliary word number and total word number, and the 19th Item characteristic value is the ratio of interjection number and total word number, and Section 20 characteristic value is the ratio of onomatopoeia number and total word number.
According to the specific embodiment of the present invention, it can further increase or delete test pictures word description feature Vector and the characteristic value in test article word description characteristic vector, but the test pictures word after increase or deletion characteristic value is retouched Characteristic vector is stated to still need to the dimension and the wherein implication of various features value and order for testing article word description characteristic vector It is consistent.
According to the specific embodiment of the present invention, user test picture character Expressive Features value generation module, which obtains, to be used Family test pictures describe text, generate user test picture character Expressive Features value;The user test picture character description is special Value indicative is consistent with the content that test pictures word description characteristic value is included, and will not be repeated here.User test picture character is retouched State feature vector generation module and user test picture character description spy is calculated according to the user test picture character Expressive Features value Sign vector;When the dimension of the test pictures word description characteristic vector is n, the active user USER figure for numbering ID The characteristic vector of the test pictures word description of piece is represented by TPCVE_ID_USER=[TPC_1_USER ..., TPC_m_ USER ..., TPC_n_USER], wherein, TPC_1_USER be active user USER user test picture character Expressive Features to The first entry value in amount, TPC_m_USER are the m in active user USER user test picture character Expressive Features vector Entry value, TPC_n_USER are the n-th entry value in active user USER user test picture character Expressive Features vector.
User's picture writing style feature vector generation module calculates user test picture character Expressive Features vector Difference between TPCVE_ID_USER test pictures reference characteristic vector T PCVE_ID corresponding with the test pictures, uses this Difference (TPCVE_ID_USER-TPCVE_ID) is used as user's picture writing style feature vector T PCVE_USER.
According to the specific embodiment of the present invention, user test article word description characteristic value generation module, which obtains, to be used Family test article describes text, generates user test article word description characteristic value;The user test article word description is special Value indicative is consistent with the content that test article word description characteristic value is included, and will not be repeated here.User test article word is retouched State feature vector generation module and user test article word description spy is calculated according to the user test article word description characteristic value Sign vector;When the dimension of the test article word description characteristic vector is n, the active user USER text for numbering ID The characteristic vector of the test article word description of chapter is represented by:TTCVE_ID_USER=[TTC_1_USER ..., TTC_m_ USER ..., TTC_n_USER], wherein, TTC_1_USER be active user USER user test article word description feature to The first entry value in amount, TTC_m_USER are the m in active user USER user test article word description characteristic vector Entry value, TTC_n_USER are the n-th entry value in active user USER user test article word description characteristic vector.
User's article writing style and features vector generation module calculates the user test article word description characteristic vector Difference between TTCVE_ID_USER test article reference characteristic vector T PCVE_ID corresponding with the test article, uses this Difference (TTCVE_ID_USER-TTCVE_ID) is used as user's article writing style and features vector T TCVE_USER.
According to the specific embodiment of the present invention, when using several test pictures or more test articles, or together When Shi Caiyong one or more test pictures and one or more test articles, the life of user test picture character Expressive Features value Text is described according to every of user test pictures respectively into module and user test article word description characteristic value generation module And test article describes text generation user test picture and/or article word description characteristic value, user test picture character Expressive Features vector generation module and user test article word description feature vector generation module are respectively according to user test figure Piece and/or article word description characteristic value generation user test picture and/or article word description characteristic vector;User's picture is write Make style feature vector generation module and user's article writing style and features vector generation module calculates each user test figure respectively Difference between piece and/or article word description characteristic vector and corresponding test pictures and/or article reference characteristic vector;It is right The picture writing style feature vector T PCVE_USER and the article style for respectively obtaining user is weighted in each difference Lattice characteristic vector TTCVE_USER;Picture writing style feature vector of user's writing style feature vector generation module to user TPCVE_USER and article writing style and features vector T TCVE_USER are weighted to obtain user's writing style feature Vector T VE_USER;The weights of the ranking operation can be chosen according to being actually needed.
TVE_USER=TPCVE_USER*WP+TTCVE_USER*WT
Wherein, WPFor user's picture writing style feature vector T PCVE_USER weight coefficients;WTFor user's article style Lattice characteristic vector TTCVE_USER weight coefficients.When user only carries out picture writing test or article writing is tested, will can join 1 is arranged to the weight coefficient of project, the weight coefficient for having neither part nor lot in project is arranged to 0.Preferably, weights can be chosen for phase Deng.
User's writing style feature vector is represented by:TVE_USER=[TVE_1 ..., TVE_m ..., TVE_n], its In, TVE_1 is the first entry value in user's writing style feature vector, and TVE_m is the m in user's writing style feature vector Entry value, TVE_n are the n-th entry value in user's writing style feature vector.
According to the specific embodiment of the present invention, user's detection pattern determining module is used to further prompt user Pass pending document;Pending file characteristics value generation module is used for the pending file characteristics value for generating the unexamined document. The pending file characteristics value includes but is not limited to:Chinese number of words, foreign language number of words, total word number, notional word number, function word number, paragraph Number, bout length distribution situation, sentence number, sentence length distribution situation, synonym, near synonym spread scenarios, function word use feelings Condition, punctuation mark service condition, part of speech service condition.According to the specific embodiment of the present invention, Chinese number of words refers to often The Chinese character number included in one pending document in addition to punctuation mark, each word of Chinese are designated as a character;Outer word Number refers to the foreign language number of characters included in the pending document of each piece in addition to punctuation mark, and each word of foreign language is designated as a word Symbol;Word number refers to the word sum obtained after being segmented to the pending document of each piece, and system can be used certainly in wherein Chinese word segmentation The participle storehouse of band is segmented, and foreign language can be segmented according to foreign language writing style, the direct space using between every word;Notional word Number refers to obtain the notional word in the pending document of each piece compared with segmenting the part of speech in storehouse according to word segmentation result after segmenting Quantity, notional word number can be further divided into Chinese notional word number and foreign language notional word number, wherein, Chinese notional word number is total with foreign language notional word number With equal to notional word number;Function word number refers to that obtaining each piece compared with segmenting the part of speech in storehouse according to word segmentation result after segmenting treats The function word quantity in document is audited, further function word number can be divided into Chinese function word number and foreign language function word number, wherein, Chinese function word number It is equal to function word number with the summation of foreign language function word number;Paragraph number refers to the paragraph quantity in the pending document of each piece;Bout length Distribution situation refers to the word number and sentence number included in each paragraph in the pending document of each piece;Sentence number refers to each Sentence quantity in the pending document of a piece;Sentence length distribution situation refers to be wrapped in each sentence in the pending document of each piece The word number contained;Synonym, near synonym spread scenarios refer to the word segmentation result in the pending document of each piece and synonymous near synonym Storehouse is compared, and the same or like participle of implication is formed into a set, the word quantity in each set is calculated, thus reflects Go out synonym, the near synonym writing custom of the author of the pending document of this, if wherein institute in synonym or near synonym set Comprising word number it is more, show that the writing style of the author tends to extend using synonym or near synonym, if synonym or Word number included near synonym set is fewer, shows that the writing style of the author tends to not use synonym or near synonym to expand Exhibition;Function word service condition refers to the statistical conditions that function word uses in the pending document of each piece, and including but not limited to each piece is treated The statistics ranking that function word uses in examination & verification document, the word number being each spaced between different function words, each it is spaced between identical function word Word number;Such as " ", " ", the service condition of " obtaining " three structural auxiliary words can also be further counted, thus reflect this Whether the author of pending document distinguishes use for " ", " ", " obtaining " three structural auxiliary words;Punctuation mark service condition Refer to the statistical conditions that punctuation mark uses in the pending document of each piece, including but not limited to each pending document acceptance of the bid of a piece The statistics ranking that point uses, the word number being each spaced between different punctuation marks, the word being each spaced between identical punctuation mark Number;Part of speech service condition refers to after participle compared with segmenting the part of speech in storehouse to obtain each piece according to word segmentation result pending The statistical conditions of each part of speech participle in document, for example, respectively obtain noun, verb, adjective, number, measure word, pronoun, adverbial word, Preposition, conjunction, auxiliary word, the quantity of interjection and onomatopoeia, and each part of speech quantity and the ratio of the total word number of the pending document of this.
According to the specific embodiment of the present invention, pending file characteristics value tag vector generation module is according to pending Core file characteristics value generates pending file characteristics vector.According to the specific embodiment of the present invention, institute is specified by system State the dimension of the characteristic vector of pending document, and particular content every in characteristic vector and the order of arrangement;It is pending The dimension of the characteristic vector of core document, and particular content every in characteristic vector and the order of arrangement should be with test charts The dimension of piece benchmark characteristic vector and test article reference characteristic vector and the wherein implication of various features value and order are still It need to be consistent.When the dimension of the characteristic vector of the pending document is n, TDCVE_USER=[TDC_ are represented by 1 ..., TDC_m ..., TDC_n], wherein, TDC_1 is the first entry value in the characteristic vector of pending document, and TDC_m is pending M entry value in the characteristic vector of core document, TDC_n are the n-th entry value in the characteristic vector of pending document.
Preferably, the characteristic vector of the pending document includes the ratio of Chinese number of words and total word number, foreign language number of words with The ratio of total word number, the ratio of notional word number and total word number, the ratio of function word number and total word number, the ratio of total word number and paragraph number, Most long paragraph word number, synonym, the ratio of near synonym spreading number and total word number, punctuation mark use the ratio of number and total word number, The ratio of noun number and total word number, the ratio of verb number and total word number, the ratio of adjective number and total word number, number number and total word Several ratio, the ratio of measure word number and total word number, the ratio of pronoun number and total word number, the ratio of adverbial word number and total word number, preposition The ratio of number and total word number, the ratio of conjunction number and total word number, the ratio of auxiliary word number and total word number, the ratio of interjection number and total word number Value, the ratio of onomatopoeia number and total word number.
User's writing style similarity calculation module is used to calculate active user's writing style similarity, can pass through following public affairs Formula calculates:
User's writing style similarity judge module is by active user's writing style similarity SimT(USER) it is pre- with system If self examination & verification thresholding be compared;As user's writing style similarity SimT(USER) higher than self examination & verification thresholding When, you can think that the pending document of active user's submission and user's writing style are inconsistent;When user's writing style similarity SimT(USER) when less than self examination & verification thresholding, you can think that the pending document that active user submits writes wind with user Lattice are consistent.
Self examination & verification thresholding is that system is set in advance.Self examination & verification threshold value setting is too high, then easily causes erroneous judgement The pending document and user's writing style that active user submits are inconsistent;Self examination & verification threshold value setting is too low, then easily makes The pending document submitted into erroneous judgement active user is consistent with user's writing style.Generally, it is described self examination & verification threshold value when by System carries out selection checking by experiment in advance, and can be adjusted at any time according to running situation by system.
According to the specific embodiment of the present invention, first self examination & verification thresholding and second self examination & verification can be set respectively Thresholding;Described first self examination & verification thresholding self examination & verification thresholding higher than second;As user's writing style similarity SimT(USER) Higher than described first during self examination & verification thresholding, you can think that the pending document that active user submits differs with user's writing style Cause;As user's writing style similarity SimT(USER) less than described second during self examination & verification thresholding, you can think active user The pending document submitted is consistent with user's writing style;As user's writing style similarity SimT(USER) it is greater than or equal to institute State second self examination & verification thresholding, and self examination & verification thresholding less than or equal to described first;Further verify user's writing style.
Described first self examination & verification thresholding and second self examination & verification thresholding are that system is set in advance.If first self examination & verification Threshold value setting is too high, then pending document and the user's writing style for easily causing erroneous judgement active user's submission are inconsistent;The Two self examination & verification threshold values settings are too low, then easily cause pending document and user's writing style that erroneous judgement active user submits Unanimously;Section is set excessive between first self examination & verification thresholding and second self examination & verification thresholding, then is easily caused too much again Verify user's writing style.Generally, described first self examination & verification threshold value and second self examination & verification threshold value are led in advance by system Cross experiment and carry out selection checking, and can be adjusted at any time according to running situation by system.
According to the specific embodiment of the present invention, further checking user's writing style refers to that user writes wind Lattice structural auxiliary word judge module;Judge pending document and user test picture describes text and/or user test article is retouched " ", " ", the service condition of " obtaining " three structural auxiliary words in text are stated, thus reflects the author of the pending document of this And active user is for " ", " ", the differentiation degree of " obtaining " three structural auxiliary words.User's writing style structural auxiliary word Judge module judges that pending document " ", " ", the service condition of " obtaining " three structural auxiliary words refer to, counts pending document " ", " ", the access times of " obtaining " in full text, are designated as T respectively1、T2And T3;Further count in pending document full text " " after institute with participle part of speech be noun number, be designated as D1;Count in pending document full text " " after institute with point The part of speech of word is the number of verb, is designated as D2;Count in pending document full text " " after institute with participle part of speech be describe The number of word, is designated as D3;Calculate " " after institute with participle part of speech be noun number and full text in " " use it is always secondary Several ratio D1/T1;Calculate " " after institute with number and full text that the part of speech of participle is verb " " using total degree Ratio D2/T2;It is the ratio using total degree " obtained " in the number and full text of verb with the part of speech of participle to calculate institute after " obtaining " D3/T3;Calculate " ", " ", " obtain " differentiation coefficient DC_TD.The numerical value for distinguishing coefficient DC_TD is more than or equal to 0, is less than Or equal to 3.
The user test picture describes text and/or user test article describes in text " ", " ", " obtaining " three The service condition of structural auxiliary word refers to that counting user test pictures describe text and/or user test article describes text in full In (such as user test several pictures and/or plurality of articles, being then incorporated as all description text in full) " ", " ", the access times of " obtaining ", be designated as T respectively1’、T2' and T3’;Further count in pending document full text " " after institute Part of speech with participle is the number of noun, is designated as D1’;Count in pending document full text " " after be with the part of speech of participle The number of verb, is designated as D2’;Count in pending document full text " " after with the part of speech of participle be adjectival number, It is designated as D3’;Calculate " " after institute with participle part of speech be noun number and full text in " " the ratio using total degree D1’/T1’;Calculate " " after institute with participle part of speech be verb number and full text in " " the ratio using total degree D2’/T2’;It is the ratio using total degree " obtained " in the number and full text of verb with the part of speech of participle to calculate institute after " obtaining " D3’/T3’;Calculate " ", " ", " obtain " differentiation coefficient DC_TPT.The numerical value for distinguishing coefficient DC_TPT is more than or equal to 0, Less than or equal to 3.
User's writing style structural auxiliary word judge module;Calculate and distinguish between coefficient DC_TD and differentiation coefficient DC_TPT Computing is normalized in drift rate DC-SC, the i.e. absolute value of the difference to distinguishing both coefficient DC_TD and differentiation coefficient DC_TPT.
When DC_SC value is less than or equal to drift rate DC-SC judgement thresholding, then user's writing style structural auxiliary word Judge module, which judges the author of pending document, and test pictures describe text and/or tests article describes the user of text and exists Style is consistent in the use of " ", " ", " obtaining " three structural auxiliary words;When DC_SC value is more than drift rate DC-SC judgement During thresholding, then user's writing style structural auxiliary word judge module judges that the author of pending document and test pictures describe text And/or test article describe the user of text " ", " ", " obtaining " three structural auxiliary words use on style it is inconsistent.Partially Shifting degree DC-SC judgement threshold value is configured in advance by system, and can be adjusted at any time according to being actually needed.Pass through system The experimental data of operation early stage is understood, when DC_SC value is less than or equal to 10%, can preferably reflect pending document Author and test pictures describe text and/or test article to describe the user of text in " ", " ", " obtaining " three structural auxiliary words Use on style it is consistent;When DC_SC value is more than 10%, then it is believed that the author of pending document retouches with test pictures State text and/or test article describe the user of text " ", " ", " obtaining " three structural auxiliary words use on style differ Cause.
User's writing style judge module is used to work as user's writing style similarity SimT(USER) greater than or equal to described Second self examination & verification thresholding, and self examination & verification thresholding less than or equal to described first;Further judge to work as by drift rate DC-SC Whether the pending document and user's writing style that preceding user submits are consistent;When drift rate DC-SC sentencing more than drift rate DC-SC During disconnected thresholding, it is believed that the pending document and user's writing style that active user submits are inconsistent;Be less than as drift rate DC-SC or During judgement thresholding equal to drift rate DC-SC, you can think pending document and user's writing style one that active user submits Cause.
According to the specific embodiment of the present invention, user's access mode detection module prompting user uploads text to be identified Shelves.
When user's detection pattern determining module judges active user's detection pattern for common plagiarism identification pattern, text to be identified Shelves word-dividing mode is used to segment document to be identified, obtains word segmentation result;When carrying out word segmentation processing to document to be identified, Need to use and carry out segmenting identical handling process with the material of comparison database.
According to the specific embodiment of the present invention, document to be identified segments parts of speech classification module;For further obtaining Obtain part of speech corresponding to word segmentation result.It is consistent with the participle mode classification for the material that comparison database is included to segment parts of speech classification mode.
According to the specific embodiment of the present invention, document participle characteristic value generation module to be identified is used to generate to wait to reflect Determine document participle characteristic value;The quantity that each participle occurs in corresponding document to be identified is counted, obtains each participle pair The participle characteristic value WCV_TBI=[W_ID, W_N] answered, wherein W_ID represent unique number of the participle in storehouse is segmented, W_N Represent the total degree that the participle occurs in the document to be identified.Preferably, it is contemplated that the part of speech of each participle, segmented Part of speech feature value WCCV_TBI=[W_ID, W_N, W_CHAR], wherein W_ID represent unique number of the participle in storehouse is segmented, W_N represents the participle total degree of the specific participle in the document to be identified, and W_CHAR represents the part of speech of the participle.
According to the specific embodiment of the present invention, document participle tightening coefficient generation module to be identified is treated for generation Identify document participle tightening coefficient.According to the specific embodiment of the present invention, the close system of participle corresponding to each participle Number can be expressed as WGC_TBI=[G_W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N-1)], wherein, G_W_ID_1 is represented The participle quantity that the participle is spaced between occurring for the first time and occur for second in the document to be identified, G_W_ID_2 are represented There is the participle quantity being spaced between third time appearance, G_W_ID_ (W_N- second in the document to be identified in the participle 1) represent that the participle participle quantity being spaced between the W_N times appearance occurs the W_N-1 times in the document to be identified;G_ W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N-1) are participle tightening coefficient corresponding to the participle.According to the one of the present invention Individual embodiment, further participle tightening coefficient corresponding to each participle can be expressed as segmenting in vector form Tightening coefficient characteristic vector W GCVE_TBI=[W_ID, W_N, W_CHAR, G_W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N- 1)], wherein W_ID represents unique number of the participle in storehouse is segmented, and W_N represents the specific participle in the document to be identified Participle total degree, W_CHAR represents the part of speech of the participle, and G_W_ID_1 represents the participle in the document to be identified for the first time The participle quantity for occurring and being spaced between occurring for second, G_W_ID_2 represent the participle second in the document to be identified There is the participle quantity being spaced between third time appearance, G_W_ID_ (W_N-1) represents the participle in the document to be identified The participle quantity being spaced between the W_N-1 times appearance and the W_N times appearance.Wherein, G_W_ID_1, G_W_ID_2 ..., G_W_ ID_ (W_N-1) is participle part of speech feature vector tightening coefficient corresponding to the participle.By segmenting characteristic vector tightening coefficient, Overall distribution situation of the specific participle in corresponding document to be identified can be known, so as in document entirety length mistake to be identified It is long, or in the case that description viewpoint is scattered, avoid according to participle total degree W_N or according to (W_N/ segments free vector dimension WFV) screening segments characteristic vector and omits crucial participle characteristic value.Preferably, can also be closely according to participle characteristic vector Number extracts specific part in a certain document to be identified and is used to contrast.
According to the specific embodiment of the present invention, document to be identified segments free vector dimension determining module, is used for Participle free vector dimension WFV_TBI is determined according to the word segmentation result of document to be identified.When the length of document to be identified is shorter or When person's word segmentation result therein is less, resulting participle free vector dimension WFV_TBI is less;When the length of document to be identified When word segmentation result longer or therein is more, resulting participle free vector dimension WFV_TBI is more.
When user's detection pattern determining module judges that active user's detection pattern plagiarizes identification pattern for extension, text to be identified Shelves participle group module is used to segment document to be identified, obtains participle group result;The wherein same or like participle of implication One group is formed, is numbered in units of group.Multiple equivalent in meaning or similar participle corresponds to a participle group #;Right , it is necessary to carry out participle identical handling process using with the material of comparison database when document to be identified carries out word segmentation processing.
According to the specific embodiment of the present invention, document participle group parts of speech classification module to be identified;For further Obtain part of speech corresponding to participle group result.The participle group mode classification for the material that participle group parts of speech classification mode is included with comparison database Unanimously.
According to the specific embodiment of the present invention, document participle group characteristic value generation module to be identified is treated for generation Identify document participle group characteristic value;The quantity that each participle group occurs in corresponding document to be identified is counted, obtains each Participle characteristic value WGCV_TBI=[WG_ID, WG_N], wherein WG_ID represent the participle group in storehouse is segmented corresponding to participle group Unique number, WG_N represents the total degree that the participle group occurs in the document to be identified.Preferably, it is contemplated that each point The part of speech of phrase, obtains participle group part of speech feature value WGCCV_TBI=[WG_ID, WG_N, WG_CHAR], and wherein WG_ID is represented Unique number of the participle group in storehouse is segmented, WG_N represent that the participle of the specific participle group in the document to be identified is always secondary Number, WG_CHAR represent the part of speech of the participle group.
According to the specific embodiment of the present invention, document participle group tightening coefficient generation module to be identified is used to generate Document to be identified segments tightening coefficient.According to the specific embodiment of the present invention, participle corresponding to each participle group is tight Close coefficient can be expressed as WGGC_TBI=[G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1)], wherein, G_ WG_ID_1 represents the participle number that the participle group is spaced between occurring for the first time and occur for second in the document to be identified Amount, G_WG_ID_2 represent that the participle group point being spaced between third time appearance occurs second in the document to be identified Word quantity, G_WG_ID_ (WG_N-1) represent that the participle group occurs and the W_N times appearance for the W_N-1 times in the document to be identified Between the participle quantity that is spaced;G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1) are that the participle group is corresponding Participle group tightening coefficient., can be further by corresponding to each participle group according to the specific embodiment of the present invention Participle group tightening coefficient is expressed as participle group tightening coefficient characteristic vector W GGCVE_TBI=[WG_ID, WG_ in vector form N, WG_CHAR, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1)], wherein WG_ID represents that the participle group is being divided Unique number in dictionary, WG_N represent the participle total degree of the specific participle group in the document to be identified, and WG_CHAR is represented The part of speech of the participle group, G_WG_ID_1 represent that the participle group occurs in the document to be identified and occur it for the second time for the first time Between the participle quantity that is spaced, G_WG_ID_2 represents that the participle group occurs with going out for the third time for second in the document to be identified The participle quantity being spaced between existing, G_WG_ID_ (WG_N-1) represent the participle group the W_N-1 times in the document to be identified The participle quantity for occurring and being spaced between occurring for the W_N times.Wherein, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1) it is participle part of speech feature vector tightening coefficient corresponding to the participle group.It is closely by participle group characteristic vector Number, overall distribution situation of the specific participle group in corresponding document to be identified can be known, so as in a document entirety piece to be identified It is long, or in the case that description viewpoint is scattered, avoid according to participle total degree W_N or according to (W_N/ segments free vector Dimension WFV) screen participle characteristic vector and omit crucial participle characteristic value.Preferably, can also be tight according to participle characteristic vector Close coefficient extracts specific part in a certain document to be identified and is used to contrast.
According to the specific embodiment of the present invention, document participle group free vector dimension determining module to be identified, use In determining participle group free vector dimension WGFV_TBI according to the word segmentation result of document to be identified.When document to be identified length compared with When word segmentation result short or therein is less, resulting participle group free vector dimension WGFV_TBI is less;When text to be identified The length of shelves is longer or when word segmentation result therein is more, and resulting participle group free vector dimension WGFV_TBI is more.
It is to be identified when user's detection pattern determining module judges active user's detection pattern for multilingual plagiarism identification pattern Foreign language participle group module is used to segment document to be identified in document, obtains middle foreign language participle group result;Wherein implication phase Same or similar middle foreign language participle forms one group, is numbered in units of group.Multiple equivalent in meaning or similar middle foreign language point Word corresponds to a middle foreign language participle group #.To document to be identified carry out word segmentation processing when, it is necessary to using with comparison database Material carries out segmenting identical handling process.
According to the specific embodiment of the present invention, document participle group parts of speech classification module to be identified;For further Obtain part of speech corresponding to participle group result.The participle group mode classification for the material that participle group parts of speech classification mode is included with comparison database Unanimously.
According to the specific embodiment of the present invention, foreign language participle group characteristic value generation module is used in document to be identified Generate foreign language participle group characteristic value in document to be identified;Foreign language participle group in each is counted in corresponding document to be identified to occur Quantity, obtain in each participle characteristic value WFGCV_TBI=[WFG_ID, WFG_N] corresponding to foreign language participle group, wherein WFG_ID represents unique number of the foreign language participle group in storehouse is segmented in this, and WFG_N represents that foreign language participle group is waited to reflect at this in this Determine the total degree occurred in document.Preferably, it is contemplated that the part of speech of foreign language participle group in each, obtain middle foreign language participle group word Property characteristic value WFGCCV_TBI=[WFG_ID, WFG_N, WFG_CHAR], wherein FWG_ID represent in this foreign language participle group point Unique number in dictionary, WFG_N represent the participle total degree of the specific middle foreign language participle group in the document to be identified, WFG_ CHAR represents the part of speech of foreign language participle group in this.
According to the specific embodiment of the present invention, foreign language participle group tightening coefficient generation module is used in document to be identified Tightening coefficient is segmented in generating foreign language in document to be identified.According to the specific embodiment of the present invention, foreign language in each Middle foreign language participle tightening coefficient corresponding to participle group can be expressed as WFGGC_TBI=[G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-1)], wherein, G_WFG_ID_1 represents that foreign language participle group goes out for the first time in the document to be identified in this The participle quantity being spaced between now occurring with second, G_WFG_ID_2 represent that foreign language participle group is in the document to be identified in this In second occur and third time occur between the participle quantity that is spaced, G_WFG_ID_ (WFG_N-1) represents foreign language point in this There is the participle quantity being spaced between the W_N times appearance the W_N-1 times in the document to be identified in phrase;G_WFG_ID_ 1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-1) are that middle foreign language participle group is closely corresponding to foreign language participle group in this Number., can be further by middle foreign language corresponding to foreign language participle group in each point according to the specific embodiment of the present invention Phrase tightening coefficient is expressed as middle foreign language participle group tightening coefficient characteristic vector W FGGCVE_TBI=[WFG_ in vector form ID, WFG_N, WFG_CHAR, G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-1)], wherein WFG_ID is represented Unique number of the foreign language participle group in storehouse is segmented in this, WFG_N represent the specific middle foreign language participle group in the document to be identified In participle total degree, WFG_CHAR represents the part of speech of foreign language participle group in this, and G_WFG_ID_1 represents foreign language participle group in this The participle quantity being spaced between occurring for the first time and occur for second in the document to be identified, G_WFG_ID_2 are represented in this There is the participle quantity being spaced between third time appearance, G_WFG_ second in the document to be identified in foreign language participle group ID_ (WG_N-1) represents that foreign language participle group the institute between the W_N times appearance occurs the W_N-1 times in the document to be identified in this The participle quantity at interval.Wherein, G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-1) are foreign language point in this Participle part of speech feature vector tightening coefficient corresponding to phrase.By middle foreign language participle group characteristic vector tightening coefficient, can know Overall distribution situation of the specific middle foreign language participle group in corresponding document to be identified.
According to the specific embodiment of the present invention, foreign language participle group free vector dimension determines mould in document to be identified Block, for foreign language participle group free vector dimension WFGFV_TBI in being determined according to the word segmentation result of document to be identified.When to be identified The length of document is shorter or when word segmentation result therein is less, resulting middle foreign language participle group free vector dimension WFGFV_ TBI is less;When the length of document to be identified is longer or word segmentation result therein is more, resulting participle group free vector Dimension WFGFV_TBI is more.
According to the specific embodiment of the present invention, document to be identified participle is simplified vector dimension generation module and is used for pair The participle free vector dimension WFV_TBI of document to be identified is simplified, and is generated document participle to be identified and is simplified vector dimension RWV_TBI.The participle is simplified vector dimension RWV_TBI and specified by the system.Preferably, system specifies participle to simplify vector Dimension RWV_TBI is 500.Preferably, system specifies participle to simplify vector dimension RWV_TBI as 800.Preferably, simplified system Specified participle simplifies vector dimension RWV_TBI as 1000.
According to the specific embodiment of the present invention, document participle to be identified simplifies vector dimension generation module use etc. Interval extraction method is simplified to document to be identified participle free vector dimension WFV_TBI.It is as follows to simplify process:Judge to be identified Whether document participle free vector dimension WFV_TBI, which is more than document to be identified participle, is simplified vector dimension RWV_TBI, if it is, Document to be identified is then segmented into free vector dimension WFV_TBI divided by simplified system specifies document participle to be identified to simplify vectorial dimension Number RWV_TBI, and upper rounding operation is carried out to resulting quotient, further obtain document to be identified and simplify coefficients R EDU_ TBI;Then carried in the characteristic value corresponding to document to be identified participle free vector dimension WFV_TBI at interval of REDU_TBI-1 Take a characteristic value;After all characteristics extractions, judge whether the quantity of extracted characteristic value is equal to text to be identified Shelves participle simplifies vector dimension RWV_TBI;Vectorial dimension is simplified when the quantity for the characteristic value extracted is equal to document to be identified participle During number RWV_TBI, then complete document participle free vector dimension WFV_TBI to be identified and simplify;When the number for the characteristic value extracted When amount simplifies vector dimension RWV_TBI less than document to be identified participle, then calculate document participle to be identified and simplify vector dimension RWV_TBI and the difference of characteristic value quantity;In the characteristic value being not extracted by random extraction and document to be identified participle simplify to The dimension RWV_TBI characteristic values equal with the difference quantities of characteristic value is measured, completes document participle free vector dimension to be identified WFV_TBI's simplifies.
According to the specific embodiment of the present invention, document participle to be identified simplifies vector dimension generation module and uses word Property screening method to document to be identified participle free vector dimension WFV_TBI simplify.It is as follows to simplify process:By characteristic value according to Corresponding participle part of speech is classified;It is that A1 classes notional word is special by feature value division according to the specific embodiment of the present invention Value indicative, A2 class notional words characteristic value, B class notional words characteristic value, C class notional words characteristic value, D class notional word characteristic values and V class function word features Value.Generally, it is considered that the effect played in the similarity comparison of characteristic value corresponding to notional word is bigger, wherein technical term noun than Common noun can more embody effective content of document to be identified.Quantity AMOUNT_A1 (the A1 of lower eigenvalue of all categories are counted respectively The quantity of class notional word characteristic value), AMOUNT_A2 (quantity of A2 class notional word characteristic values), AMOUNT_B (B class notional word characteristic values Quantity), AMOUNT_C (quantity of C class notional word characteristic values), AMOUNT_D (quantity of D class notional word characteristic values), AMOUNT_V (V The quantity of class notional word characteristic value).Calculate document participle to be identified and simplify vector dimension RWV_TBI- (AMOUNT_A1+AMOUNT_ A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V value RWV_TBI_S_V);If greater than 0, this is exited if It is secondary to simplify;If equal to 0, then complete this time to simplify;If less than 0, then further calculate document participle to be identified and simplify vector Dimension RWV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D) value RWV_S_D;If It is more than 0, then random from the characteristic value corresponding to AMOUNT_V to extract the feature equal with difference RWV_TBI_S_D quantity Value, completion are this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then document participle to be identified is further calculated Simplify vector dimension RWV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C) value RWV_TBI_S_C;Such as Fruit is more than 0, then the random extraction feature equal with difference RWV_TBI_S_C quantity from the characteristic value corresponding to AMOUNT_D Value, completion are this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then document participle to be identified is further calculated Simplify vector dimension RWV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B) value RWV_TBI_S_B;If greater than 0, The then random extraction characteristic value equal with difference RWV_TBI_S_B quantity from the characteristic value corresponding to AMOUNT_C, is completed This is simplified;If equal to 0, then complete this time to simplify;If less than 0, then further calculate document participle to be identified simplify to Measure dimension RWV_TBI- (AMOUNT_A1+AMOUNT_A2) value RWV_TBI_S_A2;If greater than 0, then from AMOUNT_B institutes The random extraction characteristic value equal with difference RWV_TBI_S_A2 quantity, completion are this time simplified in corresponding characteristic value;If Equal to 0, then complete this time to simplify;If less than 0, then further calculate document participle to be identified and simplify vector dimension RWV_TBI- AMOUNT_A1 value RWV_TBI_S_A1;It is if greater than 0, then random from the characteristic value corresponding to AMOUNT_A2 to extract and be somebody's turn to do The equal characteristic value of difference RWV_TBI_S_A1 quantity, completion are this time simplified;If equal to 0, then complete this time to simplify;It is if small In 0, then random extraction simplifies vector dimension RWV_TBI with document to be identified participle from the characteristic value corresponding to AMOUNT_A1 The equal characteristic value of quantity, completion are this time simplified.
Vector dimension RWV_TBI- (AMOUNT_A1+AMOUNT_A2+ are simplified for calculating document participle to be identified AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V value RWV_TBI_S_V) is more than 0 situation, that is, means that this is to be identified Document length is smaller or information content is less, therefore is not suitable for being contrasted using characteristic value.
Document participle free vector dimension WFV_TBI to be identified is less than document to be identified participle and simplifies vector dimension RWV_ During TBI, expression dimension itself is small, then the value under other dimensions is equivalent to 0, can Direct Mark in systems, individually include Processing.
According to the specific embodiment of the present invention, document participle group to be identified is simplified vector dimension generation module and is used for The participle group free vector dimension WGFV_TBI of document to be identified is simplified, document participle group to be identified is generated and simplifies vector Dimension RGWV_TBI.The participle group is simplified vector dimension RWGV_TBI and specified by the system.Preferably, system specifies participle Group simplifies vector dimension RWGV_TBI as 500.Preferably, system specifies participle group to simplify vector dimension RWGV_TBI as 800.It is excellent Selection of land, simplified system specify participle group to simplify vector dimension RWGV_TBI as 1000.
According to the specific embodiment of the present invention, document participle group to be identified simplifies the use of vector dimension generation module Extracted at equal intervals method is simplified to document participle group free vector dimension WGFV_TBI to be identified.It is as follows to simplify process:Judge Whether document participle group free vector dimension WGFV_TBI to be identified more than document participle group to be identified simplifies vector dimension RWGV_ TBI, if it is, document participle group free vector dimension WGFV_TBI to be identified divided by simplified system are specified into document to be identified Participle group simplifies vector dimension RWGV_TBI, and carries out upper rounding operation to resulting quotient, further obtains simplifying coefficient REDU_TBI;Then at interval of REDU_TBI-1 in the characteristic value corresponding to document participle group free vector dimension WGFV to be identified One characteristic value of individual extraction;After all characteristics extractions, judge whether the quantity of extracted characteristic value is equal to and wait to reflect Determine document participle group and simplify vector dimension RWGV_TBI;When the quantity for the characteristic value extracted is equal to document participle group to be identified essence During simple vector dimension RWGV_TBI, then complete document participle group free vector dimension WGFV_TBI to be identified and simplify;When being extracted The quantity of characteristic value when simplifying vector dimension RWGV_TBI less than document participle group to be identified, then calculate document participle to be identified Group simplifies vector dimension RWGV_TBI and characteristic value quantity difference;In the characteristic value being not extracted by random extraction with it is to be identified Document participle group simplifies the vector dimension RWGV_TBI characteristic values equal with the difference quantities of characteristic value, completes document to be identified point Phrase free vector dimension WGFV_TBI's simplifies.
According to the specific embodiment of the present invention, document participle group to be identified simplifies the use of vector dimension generation module Part of speech screening method is simplified to document participle group free vector dimension WGFV_TBI to be identified.It is as follows to simplify process:By feature Value is classified according to corresponding participle group part of speech;It is A1 by feature value division according to the specific embodiment of the present invention Class notional word characteristic value, A2 class notional words characteristic value, B class notional words characteristic value, C class notional words characteristic value, D class notional word characteristic values and V classes Function word characteristic value.Generally, it is considered that the effect played in the similarity comparison of characteristic value corresponding to notional word is bigger, wherein technical term Noun can more embody effective content of document to be identified than common noun.The quantity of lower eigenvalue of all categories is counted respectively AMOUNT_A1 (quantity of A1 class notional word characteristic values), AMOUNT_A2 (quantity of A2 class notional word characteristic values), AMOUNT_B (B classes The quantity of notional word characteristic value), AMOUNT_C (quantity of C class notional word characteristic values), the AMOUNT_D (numbers of D class notional word characteristic values Amount), AMOUNT_V (quantity of V class notional word characteristic values).Calculate document participle group to be identified and simplify vector dimension RWGV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V) value RWGV_TBI_S_V;Such as Fruit is more than 0, exits and if this time simplifies;If equal to 0, then complete this time to simplify;If less than 0, then further calculate and treat Identification document participle group simplifies vector dimension RWGV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+ AMOUNT_D value RWGV_S_D);It is if greater than 0, then random from the characteristic value corresponding to AMOUNT_V to extract and the difference The equal characteristic value of RWGV_TBI_S_D quantity, completion are this time simplified;If equal to 0, then complete this time to simplify;If less than 0, Then further calculate document participle group to be identified and simplify vector dimension RWGV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_ B+AMOUNT_C value RWGV_TBI_S_C);If greater than 0, then from the characteristic value corresponding to AMOUNT_D random extraction with The equal characteristic value of difference RWGV_TBI_S_C quantity, completion are this time simplified;If equal to 0, then complete this time to simplify;If Less than 0, then further calculate document participle group to be identified and simplify vector dimension RWGV_TBI- (AMOUNT_A1+AMOUNT_A2+ AMOUNT_B value RWGV_TBI_S_B);It is if greater than 0, then random from the characteristic value corresponding to AMOUNT_C to extract and be somebody's turn to do The equal characteristic value of difference RWGV_TBI_S_B quantity, completion are this time simplified;If equal to 0, then complete this time to simplify;It is if small In 0, then further calculate document participle group to be identified and simplify vector dimension RWGV_TBI-'s (AMOUNT_A1+AMOUNT_A2) Value RWV_TBI_S_A2;It is if greater than 0, then random from the characteristic value corresponding to AMOUNT_B to extract and difference RWGV_ The equal characteristic value of TBI_S_A2 quantity, completion are this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then enter One step calculates the value RWGV_TBI_S_A1 that document participle group to be identified simplifies vector dimension RWGV_TBI-AMOUNT_A1;If It is more than 0, then random from the characteristic value corresponding to AMOUNT_A2 to extract the spy equal with difference RWGV_TBI_S_A1 quantity Value indicative, completion are this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then from the spy corresponding to AMOUNT_A1 Random extraction and document participle the group to be identified characteristic value that to simplify vector dimension RWGV_TBI quantity equal, complete this in value indicative Simplify.
Vector dimension RWGV_TBI- (AMOUNT_A1+AMOUNT_A2+ are simplified for calculating document participle group to be identified AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V value RWGV_TBI_S_V) is more than 0 situation, that is, means that this waits to reflect Determine that document length is smaller or information content is less, therefore be not suitable for being contrasted using characteristic value.
Document participle group free vector dimension WGFV_TBI to be identified simplifies vector dimension less than document participle group to be identified During RWGV_TBI, expression dimension itself is small, then the value under other dimensions is equivalent to 0, can Direct Mark in systems, individually Include processing.
According to the specific embodiment of the present invention, foreign language participle group simplifies vector dimension generation mould in document to be identified Block is used to simplify the middle foreign language participle group free vector dimension WFGFV_TBI of document to be identified, generates document to be identified Middle foreign language participle group simplifies vector dimension RFGWV_TBI.The middle foreign language participle group simplifies vector dimension RWFGV_TBI by described System is specified.Preferably, foreign language participle group simplifies vector dimension RWFGV_TBI as 500 during system is specified.Preferably, system refers to Foreign language participle group simplifies vector dimension RWFGV_TBI as 800 in fixed.Preferably, foreign language participle group is simplified during simplified system is specified Vector dimension RWFGV_TBI is 1000.
According to the specific embodiment of the present invention, foreign language participle group simplifies vector dimension generation mould in document to be identified Block is simplified using extracted at equal intervals method to foreign language participle group free vector dimension WFGFV_TBI in document to be identified.Simplify Process is as follows:Judge whether foreign language participle group free vector dimension WFGFV_TBI is more than in document to be identified in document to be identified Foreign language participle group simplifies vector dimension RWFGV_TBI, if it is, by foreign language participle group free vector dimension in document to be identified WFGFV_TBI divided by simplified system specify foreign language participle group in document to be identified to simplify vector dimension RWFGV_TBI, and to gained To quotient carry out upper rounding operation, further obtain simplifying coefficients R EDU_TBI;The then foreign language participle group in document to be identified At interval of one characteristic value of REDU_TBI-1 extraction in characteristic value corresponding to free vector dimension WFGFV;When all features After value extraction, judge whether the quantity of extracted characteristic value is equal to foreign language participle group in document to be identified and simplifies vectorial dimension Number RWFGV_TBI;Vector dimension is simplified when the quantity for the characteristic value extracted is equal to foreign language participle group in document to be identified During RWFGV_TBI, then complete foreign language participle group free vector dimension WFGFV_TBI in document to be identified and simplify;When what is extracted When the quantity of characteristic value simplifies vector dimension RWFGV_TBI less than foreign language participle group in document to be identified, then text to be identified is calculated Foreign language participle group simplifies vector dimension RWFGV_TBI and characteristic value quantity difference in shelves;In the characteristic value being not extracted by with Machine extraction simplifies the vector dimension RWFGV_TBI spies equal with the difference quantities of characteristic value with foreign language participle group in document to be identified Value indicative, complete simplifying for foreign language participle group free vector dimension WFGFV_TBI in document to be identified.
According to the specific embodiment of the present invention, foreign language participle group simplifies vector dimension generation mould in document to be identified Block is simplified using part of speech screening method to foreign language participle group free vector dimension WFGFV_TBI in document to be identified.Simplified Journey is as follows:Characteristic value is classified according to corresponding middle foreign language participle group part of speech;According to the specific embodiment party of the present invention Formula, it is A1 class notional words characteristic value, A2 class notional words characteristic value, B class notional words characteristic value, C class notional words characteristic value, D by feature value division Class notional word characteristic value and V class function word characteristic values.Generally, it is considered that the work played in the similarity comparison of characteristic value corresponding to notional word With more greatly, wherein technical term noun can more embody effective content of document to be identified than common noun.Count respectively all kinds of Quantity AMOUNT_A1 (quantity of A1 class notional word characteristic values), the AMOUNT_A2 (numbers of A2 class notional word characteristic values of other lower eigenvalue Amount), AMOUNT_B (quantity of B class notional word characteristic values), AMOUNT_C (quantity of C class notional word characteristic values), AMOUNT_D (D classes The quantity of notional word characteristic value), AMOUNT_V (quantity of V class notional word characteristic values).Calculate document participle group to be identified and simplify vector Dimension RWFGV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V) value RWFGV_TBI_S_V;If greater than 0, exit and if this time simplify;If equal to 0, then complete this time to simplify;If less than 0, then further calculate foreign language participle group in document to be identified and simplify vector dimension RWFGV_TBI- (AMOUNT_A1+AMOUNT_ A2+AMOUNT_B+AMOUNT_C+AMOUNT_D value RWFGV_S_D);If greater than 0, then from the spy corresponding to AMOUNT_V The random extraction characteristic value equal with difference RWFGV_TBI_S_D quantity, completion are this time simplified in value indicative;If equal to 0, then Completion is this time simplified;If less than 0, then further calculate foreign language participle group in document to be identified and simplify vector dimension RWFGV_ TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C) value RWFGV_TBI_S_C;If greater than 0, then from The random extraction characteristic value equal with difference RWFGV_TBI_S_C quantity, completes this in characteristic value corresponding to AMOUNT_D It is secondary to simplify;If equal to 0, then complete this time to simplify;If less than 0, then foreign language participle group in document to be identified is further calculated Simplify vector dimension RWFGV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B) value RWFGV_TBI_S_B;It is if big It is in 0, then random from the characteristic value corresponding to AMOUNT_C to extract the feature equal with difference RWFGV_TBI_S_B quantity Value, completion are this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then document China and foreign countries to be identified are further calculated Literary participle group simplifies vector dimension RWFGV_TBI- (AMOUNT_A1+AMOUNT_A2) value RWV_TBI_S_A2;If greater than 0, then it is random from the characteristic value corresponding to AMOUNT_B to extract the characteristic value equal with difference RWFGV_TBI_S_A2 quantity, Completion is this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then foreign language in document to be identified is further calculated Participle group simplifies vector dimension RWFGV_TBI-AMOUNT_A1 value RWGV_TBI_S_A1;If greater than 0, then from AMOUNT_ The random extraction characteristic value equal with difference RWFGV_TBI_S_A1 quantity in characteristic value corresponding to A2, completes this time essence Letter;If equal to 0, then complete this time to simplify;If less than 0, then from the characteristic value corresponding to AMOUNT_A1 random extraction with Document participle group to be identified simplifies the equal characteristic value of vector dimension RWFGV_TBI quantity, and completion is this time simplified.
Vector dimension RWFGV_TBI- (AMOUNT_A1+ are simplified for calculating foreign language participle group in document to be identified AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V value RWFGV_TBI_S_V) is more than 0 situation, i.e., Mean that the document length to be identified is smaller or information content is less, therefore be not suitable for being contrasted using characteristic value.
Foreign language participle group free vector dimension WFGFV_TBI is less than foreign language participle group in document to be identified in document to be identified When simplifying vector dimension RWFGV_TBI, expression dimension itself is small, then the value under other dimensions, can be in systems equivalent to 0 Direct Mark, individually include processing.
Preferably, compared for ease of similarity, the material participle selected in system simplifies vector dimension RWV and text to be identified The participle of shelves simplifies vector dimension RWV_TBI should be equal;Material participle group simplifies vector dimension RWGV and document to be identified point Phrase simplifies vector dimension RWGV_TBI should be equal;Foreign language participle group simplifies vector dimension RWFGV and document to be identified in material Middle foreign language participle group simplify vector dimension RWFGV_TBI should be equal.
According to the specific embodiment of the present invention, document to be identified segments feature vector generation module, according to participle Simplify in each document to be identified of vector dimension RWV_TBI extractions and simplify vector dimension RWV_ with the document participle to be identified Characteristic value corresponding to TBI generates document participle characteristic vector W VE_RWV_TBI to be identified, wherein
WVE_RWV_TBI=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV_TBI,W_NRWV_TBI]
Wherein W_IDiRepresent unique number of the participle in storehouse is segmented, W_NiRepresent that the participle goes out in the document to be identified Existing total degree, the characteristic value using the number as the participle.
According to the specific embodiment of the present invention, user's detection pattern determining module judges active user's detection pattern During commonly to plagiarize identification pattern, when carrying out similarity comparison, document participle feature vector generation module to be identified, which generates, to be waited to reflect Determine the participle characteristic vector W VE_RWV_TBI of document;WVE_RWV_TBI=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_ IDRWV_TBI,W_NRWV_TBI], the dimension of the participle characteristic vector of document to be identified is RWV_TBI;Segment feature vector generation module Generate the participle characteristic vector W VE_RWV of material in comparison database;WVE_RWV=[W_ID1,W_N1,...,W_IDi,W_Ni,..., W_IDRWV,W_NRWV];Wherein, the dimension RWV_TBI of the participle characteristic vector of document to be identified is equal to the dimension of participle characteristic vector Number RWV.
It should be noted that although all use W_ID in characteristic vector W VE_RWV_TBI and WVE_RWV is segmentediTable Show unique number of the participle in storehouse is segmented, W_NiThe total degree that the participle occurs in the document to be identified is represented, and should Characteristic value of the number as the participle, but should be noted that the W_ID in participle characteristic vector W VE_RWV_TBIiHave very big May be with the W_ID in WVE_RWViAnd differ.Therefore when carrying out similarity comparison, it is necessary to segment characteristic vector by two Dimension be adjusted to consistent.
According to the specific embodiment of the present invention, file characteristics vector adjusting module to be identified is special for that will segment Levy W_ID corresponding to all characteristic values in vectorial WVE_RWV_TBIiValue carries out ascending order or descending according to the numbering in participle storehouse Arrangement, and the W_ID that will lackiValue insertion, the participle numbering W_ID of insertioniCorresponding characteristic value is 0;Assuming that in participle storehouse Participle numbering sum is W, then the participle numbering number for needing to insert is W-RWV_TBI, the document to be identified being thus expanded Segment characteristic vector W VE_RWV_TBI_EXT=[W_IDTBI_EXT_1,W_N TBI_EXT_1,...,W_ID TBI_EXT_i,W_NTBI_EXT_i,...,W_ID TBI_EXT_RWV_TBI,W_N TBI_EXT_RWV_TBI,...,W_ID W,W_N W]。
According to the specific embodiment of the present invention, material characteristic vector adjusting module, for characteristic vector will to be segmented W_ID corresponding to all characteristic values in WVE_RWViValue carries out ascending order according to the numbering in participle storehouse or descending arranges, and will lack Few W_IDiValue insertion, the participle numbering W_ID of insertioniCorresponding characteristic value is 0;Assuming that the participle numbering in participle storehouse is total Number is W, then the participle numbering number for needing to insert is W-RWV, the participle characteristic vector W VE_RWV_EXT=being thus expanded [W_ID EXT_1,W_N EXT_1,...,W_ID EXT_i,W_N EXT_i,...,W_ID EXT_RWV,W_N EXT_RWV,...,W_ID W,W_ N W]。
By the above-mentioned means, the dimension of the participle characteristic vector of the material in document to be identified and comparison database is all extended To W, and by carrying out unified arrangement according to the numbering progress ascending order in participle storehouse or descending, so as to two participle characteristic vectors pair The dimension for the characteristic value answered is consistent.
It is common to plagiarize identification similarity calculation module, calculate between any material in document to be identified and comparison database Similarity;Calculated by below equation:
According to the specific embodiment of the present invention, user's detection pattern determining module judges active user's detection pattern When plagiarizing identification pattern for extension, when carrying out similarity comparison, document participle group feature vector generation module generation to be identified is treated Identify the participle group characteristic vector W VE_RWGV_TBI of document;WVE_RWGV_TBI=[WG_ID1,WG_N1,...,WG_IDi, WG_Ni,...,WG_IDRWGV_TBI,WG_NRWGV_TBI], the dimension of the participle group characteristic vector of document to be identified is RWGV_TBI;Point The participle group characteristic vector W VE_RWGV of material in phrase feature vector generation module generation comparison database;WVE_RWGV=[WG_ ID1,WG_N1,...,WG_IDi,WG_Ni,...,WG_IDRWGV,WG_NRWGV];Wherein WG_IDiRepresent participle group in storehouse is segmented Unique number, WG_NiThe total degree that the participle group occurs in the document to be identified is represented, using the number as the participle group Characteristic value.Wherein, the dimension RWGV_TBI of the participle group characteristic vector of document to be identified is equal to the dimension of participle group characteristic vector Number RWGV.
Similar with the common processing procedure for plagiarizing identification pattern, according to the specific embodiment of the present invention, extension is copied Identification file characteristics vector adjusting module to be identified is attacked, adjusts the document participle group characteristic vector W VE_ to be identified being expanded RWGV_TBI_EXT=[WG_IDTBI_EXT_1,WG_NTBI_EXT_1,...,WG_ID TBI_EXT_i,WG_N TBI_EXT_i,...,WG_IDTBI_EXT_RWV_TBI,WG_N TBI_EXT_RWGV_TBI,...,WG_ID W,WG_N W];Material characteristic vector adjusting module, adjustment obtain The participle group characteristic vector W VE_RWGV_EXT=[WG_ID of extensionEXT_1,WG_N EXT_1,...,WG_ID EXT_i,WG_NEXT_i,...,WG_ID EXT_RWV,WG_N EXT_RWGV,...,WG_ID W,W_N W].The participle group characteristic vector W VE_ of extension RWGV_TBI_EXT=[WG_IDTBI_EXT_1,WG_N TBI_EXT_1,...,WG_ID TBI_EXT_i,WG_N TBI_EXT_i,...,WG_ ID TBI_EXT_RWGV_TBI,WG_NTBI_EXT_RWGV_TBI,...,WG_ID W,WG_N W]。
By the above-mentioned means, the dimension of the participle characteristic vector of the material in document to be identified and comparison database is all extended To W, and by carrying out unified arrangement according to the numbering progress ascending order in participle storehouse or descending, so as to two participle characteristic vectors pair The dimension for the characteristic value answered is consistent.
Identification similarity calculation module is plagiarized in extension, is calculated between any material in document to be identified and comparison database Similarity;Calculated by below equation:
According to the specific embodiment of the present invention, user's detection pattern determining module judges active user's detection pattern For multilingual plagiarism identification pattern when, when carrying out similarity comparison, foreign language participle group characteristic vector generation mould in document to be identified Block generates the middle foreign language participle group characteristic vector W VE_RWFGV_TBI of document to be identified;WVE_RWFGV_TBI=[WFG_ID1, WFG_N1,...,WFG_IDi,WFG_Ni,...,WFG_IDRWFGV_TBI,WFG_NRWFGV_TBI], the middle foreign language participle of document to be identified The dimension of group characteristic vector is RWFGV_TBI;The middle foreign language point of material in participle group feature vector generation module generation comparison database Phrase characteristic vector W VE_RWFGV;WVE_RWFGV=[WFG_ID1,WFG_N1,...,WFG_IDi,WFG_Ni,...,WFG_ IDRWFGV,WFG_NRWFGV];Wherein WFG_IDiUnique number of the foreign language participle group in storehouse is segmented, WFG_N in expressioniRepresenting should The total degree that middle foreign language participle group occurs in the document to be identified, the characteristic value using the number as foreign language participle group in this. Wherein, the dimension RWFGV_TBI of the middle foreign language participle group characteristic vector of document to be identified is equal to middle foreign language participle group characteristic vector Dimension RWFGV.
It is similar with the common processing procedure for plagiarizing identification pattern, it is multilingual according to the specific embodiment of the present invention Plagiarize under identification pattern, file characteristics vector adjusting module to be identified, adjust foreign language in the document to be identified being expanded and segment Group characteristic vector W VE_RWFGV_TBI_EXT=[WFG_IDTBI_EXT_1,WFG_N TBI_EXT_1,...,WFG_ID TBI_EXT_i, WFG_N TBI_EXT_i,...,WFG_ID TBI_EXT_RWFGV_TBI,WFG_N TBI_EXT_RWFGV_TBI,...,WFG_ID W,WFG_N W]; Material characteristic vector adjusting module, adjust the participle group characteristic vector W VE_RWFGV_EXT=[WFG_ID being expandedEXT_1, WFG_N EXT_1,...,WFG_ID EXT_i,WFG_N EXT_i,...,WFG_ID EXT_RWV,WFG_N EXT_RWFGV,...,WFG_IDW,WFG_N W].The participle characteristic vector W VE_RWFGV_TBI_EXT=[WFG_ID of extensionTBI_EXT_1,WFG_NTBI_EXT_1,...,WFG_ID TBI_EXT_i,WFG_N TBI_EXT_i,...,WFG_ID TBI_EXT_RWFGV_TBI,WFG_NTBI_EXT_RWFGV_TBI,...,WFG_ID W,WFG_N W]。
By the above-mentioned means, the dimension of the participle characteristic vector of the material in document to be identified and comparison database is all extended To W, and by carrying out unified arrangement according to the numbering progress ascending order in participle storehouse or descending, so as to two participle characteristic vectors pair The dimension for the characteristic value answered is consistent.
It is multilingual to plagiarize identification similarity calculation module, calculate between any material in document to be identified and comparison database Similarity;Calculated by below equation:
According to the specific embodiment of the present invention, to avoid the dimension after extension excessive, also can will participle feature to All participle ID in WVE_RWV_TBI are measured as a set;And collect the participle ID in WVE_RWV as another Close;Or using all participle ID in participle group characteristic vector W VE_RWGV_TBI as a set;And by WVE_RWGV In participle ID as another gather;Or by all points in middle foreign language participle group characteristic vector W VE_RWFGV_TBI Word ID is as a set;And gather the participle ID in WVE_RWFGV as another;Two collection conjunction unions obtain total Segment ID set;Gather according to total participle ID by the dimension of the participle characteristic vector of the material in document to be identified and comparison database Number is extended, and ID will be segmented corresponding to all characteristic values and carries out ascending order or descending arrangement according to the numbering in participle storehouse, is inserted Enter and included in total participle ID set and originally itself gathered the W_ID not includediValue, the participle numbering W_ID insertediIt is corresponding Characteristic value be 0;Or included in the total participle group ID set of insertion and WG_ID that itself original set does not includeiValue, is inserted The participle numbering WG_ID enterediCorresponding characteristic value is 0;Or included in the total middle foreign language participle group ID set of insertion and original The WFG_ID that itself set does not includeiValue, the participle numbering WFG_ID insertediCorresponding characteristic value is 0.
According to the access mode of user, there is provided the material of different word banks carries out similarity comparison in comparison database, compares and uses The mode of traversal, the characteristic vector pickup that will select all materials in scope are come out, and similarity is carried out with document to be identified Contrast;And contrasted the Similarity value being calculated with predetermined threshold, will when Similarity value is higher than predetermined threshold Corresponding material records standby as doubtful material.
After the completion of document to be identified and the contrast of all materials, extract all doubtful materials, by document to be identified with it is doubtful Material is further contrasted.
According to a preferred embodiment of the invention, can will be in proverb common saying storehouse, famous sayings of famous figures storehouse, poem storehouse it is all Material selectiong is doubtful material.
According to a preferred embodiment of the invention, participle free vector dimension WFV can be simplified vector less than participle Dimension RWV material selectiong is doubtful material.
According to a preferred embodiment of the invention, participle group free vector dimension WGFV can be simplified less than participle group Vector dimension RWGV material selectiong is doubtful material.
According to a preferred embodiment of the invention, during can middle foreign language participle group free vector dimension WFGFV be less than The material selectiong that foreign language participle group simplifies vector dimension RWFGV is doubtful material.
According to a preferred embodiment of the invention, doubtful material can be further chosen by segmenting tightening coefficient.
According to the specific embodiment of the present invention, common plagiarize can be according to point of document to be identified under identification pattern The participle tightening coefficient of word tightening coefficient and material screens doubtful material.Document tightening coefficient statistical module to be identified is according to this Participle tightening coefficient characteristic vector W GCVE_TBI=[W_ID, W_N, W_CHAR, G_W_ID_ corresponding to being segmented in document to be identified 1, G_W_ID_2 ..., G_W_ID_i ..., G_W_ID_ (W_N-1)] extraction high density participle, and corresponding position.It is described to wait to reflect Determine participle part of speech W_CHAR of the document tightening coefficient statistical module in participle tightening coefficient characteristic vector, choose part of speech as in fact The participle of word, and count the spacing participle total amount of predetermined adjacent quantity participle:Wherein n is predetermined adjacent Quantity, when the spacing participle total amount of predetermined adjacent quantity participle is less than predetermined close threshold T HGWhen, then record the participle ID and corresponding position.
According to the specific embodiment of the present invention, extension is plagiarized can be according to point of document to be identified under identification pattern The participle group tightening coefficient of phrase tightening coefficient and material screens doubtful material.Document tightening coefficient statistical module root to be identified According to participle tightening coefficient characteristic vector W GGCVE_TBI=[WG_ID, WG_N, WG_ corresponding to participle group in the document to be identified CHAR, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_i ..., G_WG_ID_ (W_N-1)] extraction high density participle group, and Corresponding position.Participle group of the document tightening coefficient statistical module to be identified in participle group tightening coefficient characteristic vector Part of speech WG_CHAR, the participle group that part of speech is notional word is chosen, and count the spacing participle total amount for making a reservation for adjacent quantity participle group:Wherein n is to make a reservation for adjacent quantity, when the spacing participle total amount for making a reservation for adjacent quantity participle group is less than in advance Fixed close threshold T HGWhen, then record the ID of the participle group and corresponding position.
, can be according to document to be identified under multilingual plagiarism identification pattern according to the specific embodiment of the present invention The middle foreign language participle group tightening coefficient of middle foreign language participle group tightening coefficient and material screens doubtful material.Document to be identified is close Coefficients statistics module segments tightening coefficient characteristic vector W FGGCVE_ according to corresponding to middle foreign language participle group in the document to be identified TBI=[WFG_ID, WFG_N, WFG_CHAR, G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_i ..., G_WFG_ID_ (W_N-1) high density participle group, and corresponding position] are extracted.The document tightening coefficient statistical module to be identified is according to China and foreign countries Participle group part of speech WFG_CHAR in literary participle group tightening coefficient characteristic vector, choose part of speech and be the participle group of notional word, and count Make a reservation for the spacing participle total amount of adjacent quantity participle group:Wherein n is to make a reservation for adjacent quantity, when predetermined The spacing participle total amount of adjacent quantity participle group is less than predetermined close threshold T HGWhen, then record foreign language participle group in this ID and corresponding position.
The value for making a reservation for adjacent quantity n and close threshold T HGPre-set by system, and can be according to reality Need to be adjusted;When the spacing participle total amount of predetermined adjacent quantity participle is less than predetermined close threshold T HGWhen, then it can recognize It is more intensive in relevant position appearance for notional word participle, it is possible to which that concentration elaborates a certain viewpoint, it is necessary to which emphasis is paid close attention to.
It is common to plagiarize under identification pattern, the doubtful story extraction module of tightening coefficient, according between predetermined adjacent quantity participle It is less than predetermined close threshold T H every participle total amountGWhen, the participle ID that is recorded, extract and all in comparison database include the participle ID material;Calculate respectively participle tightening coefficient characteristic vector W GCVE=corresponding with participle ID in material [W_ID, W_N, W_CHAR, G_W_ID_1, G_W_ID_2 ..., G_W_ID_i ..., G_W_ID_ (W_N-1)], the predetermined adjacent quantity participle of statistics Spacing participle total amount:Wherein n is to make a reservation for adjacent quantity, when the interval point of predetermined adjacent quantity participle Word total amount is less than predetermined close threshold T HGWhen, then it is doubtful material by the material selectiong.The participle ID is one or more It is individual, it is one or more according to the material comprising one or more participle ID is extracted for one or more participle ID.
Extension is plagiarized under identification pattern, the doubtful story extraction module of tightening coefficient, according to predetermined adjacent quantity participle group Spacing participle total amount is less than predetermined close threshold T HGWhen, the participle group ID that is recorded, extract all comprising should in comparison database Segment the material of ID groups;Participle group tightening coefficient characteristic vector W GGCVE=corresponding with participle group ID in material is calculated respectively [WG_ID, WG_N, WG_CHAR, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_i ..., G_WG_ID_ (WG_N-1)], system Meter makes a reservation for the spacing participle total amount of adjacent quantity participle group:Wherein n is to make a reservation for adjacent quantity, when predetermined The spacing participle group total amount of adjacent quantity participle is less than predetermined close threshold T HGWhen, then it is doubtful element by the material selectiong Material.The participle group ID is one or more, is extracted according to for one or more participle group ID comprising the one or more point Phrase ID material is one or more.
Under multilingual plagiarism identification pattern, the doubtful story extraction module of tightening coefficient, according to predetermined adjacent quantity China and foreign countries text The spacing participle total amount of participle group is less than predetermined close threshold T HGWhen, the middle foreign language participle group ID that is recorded, extraction contrast All materials for including foreign language participle ID groups in this in storehouse;China and foreign countries corresponding with foreign language participle group ID in this in material are calculated respectively Literary participle group tightening coefficient characteristic vector W FGGCVE=[WFG_ID, WFG_N, WFG_CHAR, G_WFG_ID_1, G_WFG_ID_ 2 ..., G_WFG_ID_i ..., G_WFG_ID_ (WFG_N-1)], the spacing participle of the literary participle group in the predetermined adjacent quantity China and foreign countries of statistics Total amount:Wherein n is to make a reservation for adjacent quantity, when foreign language segments in the interval of predetermined adjacent quantity participle Group total amount is less than predetermined close threshold T HGWhen, then it is doubtful material by the material selectiong.The middle foreign language participle group ID is One or more, extracted according to for foreign language participle group ID in one or more comprising foreign language participle group ID in the one or more Material for one or more.
By this extracting mode, can total degree occur not high by some in the document to be identified, but may be at certain Notional word participle and corresponding position described in the collection of a little positions are extracted and further compared.
According to the specific embodiment of the present invention, in the case where formula plagiarizes identification pattern, formulas Extraction module, for inciting somebody to action Extract the formula in document to be identified;Formula decomposing module, for by respective variable parameter and the dependent variable parameter of formula, fortune Operator number, the concrete meaning of each parameter, dimension and span are extracted respectively;Formula contrast module, for that will wait to reflect Determine respective variable parameter and dependent variable parameter, oeprator, the concrete meaning of each parameter, the dimension of formula extracted in document And span and respective variable parameter and dependent variable parameter, oeprator, each parameter of the formula preserved in formula storehouse Concrete meaning, dimension and span compared one by one;When the formula in document to be identified respective variable parameter with And the formula preserved in dependent variable parameter, oeprator, dimension and span and formula storehouse respective variable parameter and Dependent variable parameter, oeprator, the registration of dimension and span exceed formula comparison threshold T HMATHWhen, by formula With the material that currently formula is associated by compared with as doubtful material in storehouse.The registration refers to the formula in document to be identified Compared with the formula in formula storehouse, identical independent variable parameter, dependent variable parameter, oeprator, dimensions number sum with it is to be identified The independent variable parameter of current formula, dependent variable parameter, oeprator, the ratio of dimensions number sum in document.
According to the specific embodiment of the present invention, document to be identified and doubtful material can be entered using sliding window Row contrasts in full.The size of sliding window can be configured by system.The size of sliding window directly affects contrast effect, sliding Dynamic window selection is too small, easily causes erroneous judgement, sliding window selection is excessive, easily causes and fails to judge.The slip step of sliding window Length is also pre-set by system.As shown in Fig. 2 step S0:Start;S1:Sliding window setup module initializes similar window Mouth counter CT1=0, Hua Dong Walk long counters CT2=0;Step S2:Sliding window setup module sets document to be identified with doubting Document original position is respectively positioned on like the sliding window of material;Step S3:Sliding window contrast module contrasts the cunning of document to be identified The sliding window of dynamic window and doubtful material, the quantity of statistics wherein identical notional word participle;Step S4:Sliding window contrasts mould Block judges whether the quantity of identical notional word participle is more than or equal to threshold T HW;When more than or equal to threshold value hour counter Value plus one, i.e. CT1=CT1+ 1, and record the position and cunning for identifying that the sliding window of document is current with the sliding window of doubtful material Content in dynamic window;Step S5:Sliding window setup module sets the sliding window of doubtful material to slide a sliding step; Step S6:Sliding window setup module judges whether at document end position;If not end position, then return to step S3:If end position, then step S11 is gone to;Step S11:Sliding window setup module judges the slip of document to be identified Whether window is at document end position;If not end position, then step S12 is gone to, if end position, then gone Toward step S13;Step S12:Sliding window setup module sets the sliding window of doubtful material to return to document original position;Wait to reflect The sliding window for determining document slides a sliding step, CT2=CT2+ 1 goes to step S3;Step S13:Sliding window contrast module Calculate similar window counter CT1Numerical value Yu Hua Dong Walk long counters CT2The ratio M of numerical value;S14:Sliding window contrast module is sentenced Whether disconnected ratio M is more than or equal to predetermined threshold value THm, as M >=THMWhen, then it is assumed that the document to be identified and the doubtful material phase Seemingly;Work as M<THMWhen, then it is assumed that the document to be identified and the doubtful material are dissimilar;S15:Sliding window contrast module judges It is no to also have doubtful material to need to contrast, if so, then return to step S1;Step S16 is gone to if not;Step S16:Contrast Report generation module is generated and exports comparison report, and the identification document and all similar doubtful elements are included in the comparison report The similar window counter CT of material1Numerical value, Hua Dong Walk long counters CT2Numerical value, and both ratio, the identification document and phase As doubtful material similar portion particular location and particular content;Step S17:Contrast terminates.
According to the specific embodiment of the present invention, step S3:Sliding window contrast module contrasts document to be identified The sliding window of sliding window and doubtful material, the quantity of statistics wherein identical notional word participle;Wherein identification is plagiarized common Under pattern, identical notional word participle refers to that ID of the notional word participle in storehouse is segmented is identical;Wherein in the case where identification pattern is plagiarized in extension, Identical notional word participle refers to that ID of the notional word participle group in storehouse is segmented is identical;Wherein under multilingual plagiarism identification pattern, phase With notional word participle refer to that ID of the foreign language participle group in storehouse is segmented is identical in notional word.
According to the specific embodiment of the present invention, step S16:Comparison report generation module exports comparison report, enters One step includes the content of comparison report according to the different and different of identification pattern.It is common to plagiarize under identification pattern, in comparison report Particular location and particular content comprising the document to be identified to similar doubtful material similar portion;Document to be identified uses The form of presentation consistent with similar portion in the similar doubtful material;The word statement used is also completely the same;May Only indivedual word orders are adjusted;If the document that identified document is plagiarized to it is rewritten, when the degree of rewriting compared with When big, common identification pattern of plagiarizing possibly can not find its document plagiarized.Extension is plagiarized under identification pattern, in comparison report Particular location and particular content comprising the document to be identified to similar doubtful material similar portion;If identified document The document plagiarized to it has carried out synonym or near synonym are rewritten, and when file structure rewriting is little, identification mould is plagiarized in extension Formula may can also find its document plagiarized.Under multilingual plagiarism identification pattern, the document to be identified is included in comparison report To the particular location and particular content of similar doubtful material similar portion;If the document that identified document is plagiarized to it Carry out translation to rewrite, when file structure rewriting degree is little, extension plagiarism identification pattern may can also find it and be plagiarized Document.
According to the specific embodiment of the present invention, sliding window is located at document original position and refers to sliding window most Left side overlaps with document original position;Sliding window is located at document end position and refers to that the rightmost side of sliding window and document terminate Position overlaps.
According to system, operation test, sliding window selection are that four notional words participle sizes are more suitable in advance, sliding window Size can also select as needed as other sizes.Sliding window slides the step-length of a notional word participle every time during contrast; (elder generation of notional word participle is not considered now when occurring three in sliding window or more than three notional word participles are identical in comparison process Order afterwards), then record current location and content of the sliding window in document to be identified and doubtful material.
The above described is only a preferred embodiment of the present invention, any formal limitation not is made to the present invention, though So the present invention is disclosed above with preferred embodiment, but is not limited to the present invention, any to be familiar with this professional technology people Member, without departing from the scope of the present invention, when the technology contents using the disclosure above make a little change or modification For the equivalent embodiment of equivalent variations, as long as being the content without departing from technical solution of the present invention, the technical spirit according to the present invention Any simple modification, equivalent change and modification made to above example, in the range of still falling within technical solution of the present invention.

Claims (10)

  1. A kind of 1. paper self-checking system, it is characterised in that including:User's detection pattern determining module and the test of user's writing style Module, wherein,
    User's detection pattern determining module is used to determine that active user's detection pattern is self audit mode;
    User's writing style test module provides the user one or more test pictures, by user's pin at the appointed time Test pictures are carried out with the word description no less than regulation number of words online;Wherein every width test pictures all have test pictures benchmark Characteristic vector;
    The test pictures reference characteristic vector is the benchmark test personnel that predetermined quantity is randomly selected from different background crowds, The description no less than regulation number of words is carried out with regard to fc-specific test FC picture respectively, all word descriptions is gathered, counts same test chart The test pictures word description characteristic value of piece, characteristic vector is calculated according to the test pictures word description characteristic value, and to spy Sign vector is weighted, and obtains the test pictures reference characteristic vector of fc-specific test FC picture;Power in the ranking operation Value is set by system;
    The test pictures that test pictures word description characteristic value generation module obtains benchmark test personnel describe text, generate user Test pictures word description characteristic value;
    Test pictures word description characteristic value generation module generates test pictures word according to test pictures word description characteristic value Expressive Features vector;When the dimension of the test pictures word description characteristic vector is n, TPCVE=[TPC_ are expressed as 1 ..., TPC_m ..., TPC_n], wherein, TPC_1 be test pictures word description characteristic vector in the first entry value, TPC_m For the m entry value in the characteristic vector of test pictures word description, TPC_n is in the characteristic vector of test pictures word description N-th entry value;
    Test pictures word description characteristic vector of the test pictures reference characteristic vector generation module statistics for same test;It is right Test pictures word description characteristic vector is weighted, and obtains fc-specific test FC picture reference characteristic vector, the weighting fortune The weights used in calculation are set by system;
    Fc-specific test FC picture reference characteristic vector representation is:
    Wherein TPCVE_ID represents the test pictures reference characteristic vector that numbering is ID;Tester's quantity on the basis of k;TPC_1i Represent the first entry value of the characteristic vector of i-th of benchmark test personnel;TPC_miRepresent the feature of i-th of benchmark test personnel to The m entry value of amount;TPC_niRepresent the n-th entry value of the characteristic vector of i-th of benchmark test personnel;W1,iFor TPC_1iWeighting system Number;Wm,iFor TPC_miWeight coefficient;Wn, iFor TPC_niWeight coefficient;
    User test picture character Expressive Features value generation module obtains user test picture and describes text, generates user test figure Piece word description characteristic value;
    User test picture character Expressive Features vector generation module calculates according to the user test picture character Expressive Features value User test picture character Expressive Features vector;It is current to use when the dimension of the test pictures word description characteristic vector is n The characteristic vector of the test pictures word description of the family USER picture for numbering ID is expressed as TPCVE_ID_USER= [TPC_1_USER ..., TPC_m_USER ..., TPC_n_USER], user's picture writing style feature vector generation module calculate User test picture character Expressive Features vector T PCVE_ID_USER test pictures reference characteristics corresponding with the test pictures Difference between vector T PCVE_ID, user's picture writing style is used as using difference TPCVE_ID_USER-TPCVE_ID Characteristic vector TPCVE_USER.
  2. 2. paper self-checking system according to claim 1,
    Wherein, TPC_1_USER is the first entry value in active user USER user test picture character Expressive Features vector, TPC_m_USER be active user USER user test picture character Expressive Features vector in m entry value, TPC_n_USER For the n-th entry value in active user USER user test picture character Expressive Features vector.
  3. 3. paper self-checking system according to claim 1 or 2, the test pictures word description characteristic vector includes following It is one or more in items:Chinese number of words and the ratio of total word number, foreign language number of words and the ratio of total word number, notional word number and total word Several ratio, the ratio of function word number and total word number, the ratio of total word number and paragraph number, most long paragraph word number, synonym, near synonym The ratio of spreading number and total word number, punctuation mark use the ratio of number and total word number, the ratio of noun number and total word number, verb number With the ratio of total word number, the ratio of adjective number and total word number, the ratio of number number and total word number, the ratio of measure word number and total word number Value, the ratio of pronoun number and total word number, the ratio of adverbial word number and total word number, the ratio of preposition number and total word number, conjunction number with it is total The ratio of word number, the ratio of auxiliary word number and total word number, the ratio of interjection number and total word number, the ratio of onomatopoeia number and total word number.
  4. 4. paper self-checking system according to claim 3, user's detection pattern determining module is used to further prompt user Upload pending document;Pending file characteristics value generation module is used for the pending file characteristics for generating the unexamined document Value;Pending file characteristics value tag vector generation module according to pending file characteristics value generate pending file characteristics to Amount;The dimension of the characteristic vector of pending document, and particular content every in characteristic vector and the order of arrangement and survey Attempt piece benchmark characteristic vector and test the dimension and the wherein implication of various features value and suitable of article reference characteristic vector Sequence still needs to be consistent.
  5. 5. paper self-checking system according to claim 4, user's writing style similarity calculation module is used to calculate currently User's writing style similarity, is calculated by below equation:
    User's writing style similarity judge module is by active user's writing style similarity SimT(USER) with systemic presupposition from I audits thresholding and is compared;As user's writing style similarity SimT(USER) when higher than self examination & verification thresholding, that is, recognize The pending document and user's writing style submitted for active user are inconsistent;As user's writing style similarity SimT(USER) During less than self examination & verification thresholding, that is, think that the pending document that active user submits is consistent with user's writing style.
  6. A kind of 6. paper self checking method, it is characterised in that including:
    User's detection pattern determining module determines that active user's detection pattern is self audit mode;
    User's writing style test module provides the user one or more test pictures, by user at the appointed time for surveying Attempt the word description that piece carries out being no less than online regulation number of words;Wherein every width test pictures all have test pictures reference characteristic Vector;
    The test pictures reference characteristic vector is the benchmark test personnel that predetermined quantity is randomly selected from different background crowds, The description no less than regulation number of words is carried out with regard to fc-specific test FC picture respectively, all word descriptions is gathered, counts same test chart The test pictures word description characteristic value of piece, characteristic vector is calculated according to the test pictures word description characteristic value, and to spy Sign vector is weighted, and obtains the test pictures reference characteristic vector of fc-specific test FC picture;Power in the ranking operation Value is set by system;
    The test pictures that test pictures word description characteristic value generation module obtains benchmark test personnel describe text, generate user Test pictures word description characteristic value;
    Test pictures word description characteristic value generation module generates test pictures word according to test pictures word description characteristic value Expressive Features vector;When the dimension of the test pictures word description characteristic vector is n, TPCVE=[TPC_ are expressed as 1 ..., TPC_m ..., TPC_n], wherein, TPC_1 be test pictures word description characteristic vector in the first entry value, TPC_m For the m entry value in the characteristic vector of test pictures word description, TPC_n is in the characteristic vector of test pictures word description N-th entry value;
    Test pictures word description characteristic vector of the test pictures reference characteristic vector generation module statistics for same test;It is right Test pictures word description characteristic vector is weighted, and obtains fc-specific test FC picture reference characteristic vector, the weighting fortune The weights used in calculation are set by system;
    Fc-specific test FC picture reference characteristic vector representation is:
    Wherein TPCVE_ID represents the test pictures reference characteristic vector that numbering is ID;Tester's quantity on the basis of k;TPC_1i Represent the first entry value of the characteristic vector of i-th of benchmark test personnel;TPC_miRepresent the feature of i-th of benchmark test personnel to The m entry value of amount;TPC_niRepresent the n-th entry value of the characteristic vector of i-th of benchmark test personnel;W1,iFor TPC_1iWeighting system Number;Wm,iFor TPC_miWeight coefficient;Wn, iFor TPC_niWeight coefficient;
    User test picture character Expressive Features value generation module obtains user test picture and describes text, generates user test figure Piece word description characteristic value;
    User test picture character Expressive Features vector generation module calculates according to the user test picture character Expressive Features value User test picture character Expressive Features vector;It is current to use when the dimension of the test pictures word description characteristic vector is n The characteristic vector of the test pictures word description of the family USER picture for numbering ID is expressed as TPCVE_ID_USER= [TPC_1_USER ..., TPC_m_USER ..., TPC_n_USER], user's picture writing style feature vector generation module calculate User test picture character Expressive Features vector T PCVE_ID_USER test pictures reference characteristics corresponding with the test pictures Difference between vector T PCVE_ID, user's picture writing style is used as using difference TPCVE_ID_USER-TPCVE_ID Characteristic vector TPCVE_USER.
  7. 7. paper self checking method according to claim 6, wherein
    Wherein, TPC_1_USER is the first entry value in active user USER user test picture character Expressive Features vector, TPC_m_USER be active user USER user test picture character Expressive Features vector in m entry value, TPC_n_USER For the n-th entry value in active user USER user test picture character Expressive Features vector.
  8. 8. the paper self checking method according to claim 6 or 7, wherein
    The test pictures word description characteristic vector includes one or more in the following:Chinese number of words and total word number Ratio, foreign language number of words and the ratio of total word number, the ratio of notional word number and total word number, the ratio of function word number and total word number, total word number Number is used with the ratio of paragraph number, most long paragraph word number, synonym, the ratio of near synonym spreading number and total word number, punctuation mark With the ratio of total word number, the ratio of noun number and total word number, the ratio of verb number and total word number, the ratio of adjective number and total word number Value, the ratio of number number and total word number, the ratio of measure word number and total word number, the ratio of pronoun number and total word number, adverbial word number with it is total The ratio of word number, the ratio of preposition number and total word number, the ratio of conjunction number and total word number, the ratio of auxiliary word number and total word number, sigh The ratio of word number and total word number, the ratio of onomatopoeia number and total word number.
  9. 9. paper self checking method according to claim 8, wherein
    User's detection pattern determining module is used to further prompt user to upload pending document;Pending file characteristics value generation Module is used for the pending file characteristics value for generating the unexamined document;Pending file characteristics value tag vector generation module root Pending file characteristics vector is generated according to pending file characteristics value;The dimension of the characteristic vector of pending document, and feature Every particular content and the order of arrangement and test pictures reference characteristic vector and test article reference characteristic in vector The dimension of vector and the wherein implication of various features value and order still need to be consistent.
  10. 10. paper self checking method according to claim 9, user's writing style similarity calculation module is used to calculate currently User's writing style similarity, is calculated by below equation:
    User's writing style similarity judge module is by active user's writing style similarity SimT(USER) with systemic presupposition from I audits thresholding and is compared;As user's writing style similarity SimT(USER) when higher than self examination & verification thresholding, that is, recognize The pending document and user's writing style submitted for active user are inconsistent;As user's writing style similarity SimT(USER) During less than self examination & verification thresholding, that is, think that the pending document that active user submits is consistent with user's writing style.
CN201610021493.6A 2016-01-13 2016-01-13 A kind of paper self checking method and system Active CN105677641B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610021493.6A CN105677641B (en) 2016-01-13 2016-01-13 A kind of paper self checking method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610021493.6A CN105677641B (en) 2016-01-13 2016-01-13 A kind of paper self checking method and system

Publications (2)

Publication Number Publication Date
CN105677641A CN105677641A (en) 2016-06-15
CN105677641B true CN105677641B (en) 2018-03-16

Family

ID=56300443

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610021493.6A Active CN105677641B (en) 2016-01-13 2016-01-13 A kind of paper self checking method and system

Country Status (1)

Country Link
CN (1) CN105677641B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250491A (en) * 2016-08-01 2016-12-21 北京金和网络股份有限公司 The method of article automatization examination & verification and system thereof
CN110008333A (en) * 2019-04-16 2019-07-12 中国农业科学院农田灌溉研究所 A kind of paper preliminary inquiry evaluation method
CN110472228B (en) * 2019-07-10 2023-04-07 哈尔滨工程大学 Crack detection method based on author writing style

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103257957A (en) * 2012-02-15 2013-08-21 深圳市腾讯计算机系统有限公司 Chinese word segmentation based text similarity identifying method and device
CN104239285A (en) * 2013-06-06 2014-12-24 腾讯科技(深圳)有限公司 New article chapter detecting method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040054520A1 (en) * 2002-07-05 2004-03-18 Dehlinger Peter J. Text-searching code, system and method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103257957A (en) * 2012-02-15 2013-08-21 深圳市腾讯计算机系统有限公司 Chinese word segmentation based text similarity identifying method and device
CN104239285A (en) * 2013-06-06 2014-12-24 腾讯科技(深圳)有限公司 New article chapter detecting method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
语义分析在汉语相似性文献检测中的应用研究;谈文蓉 等;《四川师范大学学报(自然科学版)》;20100731;第33卷(第4期);第554-558页 *

Also Published As

Publication number Publication date
CN105677641A (en) 2016-06-15

Similar Documents

Publication Publication Date Title
CN105701076A (en) Thesis plagiarism detection method and system
Jayakodi et al. An automatic classifier for exam questions in Engineering: A process for Bloom's taxonomy
Solovyev et al. Prediction of reading difficulty in Russian academic texts
BRPI0913815B1 (en) computer equipment and method for extracting terms from document data including text segments
Sheehan et al. A two-stage approach for generating unbiased estimates of text complexity
CN105701085A (en) Network duplicate checking method and system
CN105677641B (en) A kind of paper self checking method and system
Chamberlain et al. Phrase Detectives Corpus 1.0 Crowdsourced Anaphoric Coreference.
Ronan et al. Determining light verb constructions in contemporary British and Irish English
CN110472203A (en) A kind of duplicate checking detection method, device, equipment and the storage medium of article
Argamon Computational forensic authorship analysis: Promises and pitfalls
CN105701086A (en) Method and system for detecting literature through sliding window
Wadud et al. Text coherence analysis based on misspelling oblivious word embeddings and deep neural network
Rahman et al. NLP-based automatic answer script evaluation
Curtotti et al. Machine learning for readability of legislative sentences
Yan et al. On the robustness of reading comprehension models to entity renaming
CN105550172B (en) A kind of distributed text detection method and system
CN105701213B (en) A kind of document control methods and system
Taerungruang et al. Constructing an Academic Thai Plagiarism Corpus for Benchmarking Plagiarism Detection Systems.
Bian et al. Detecting spam game reviews on steam with a semi-supervised approach
Wieling et al. Hierarchical spectral partitioning of bipartite graphs to cluster dialects and identify distinguishing features
CN105701206B (en) A kind of document detection method and system based on sampling
Chaturvedi et al. Detecting fake news using machine learning algorithms
CN105701077A (en) Multi-language literature detection method and system
Shrestha Detecting Fake News with Sentiment Analysis and Network Metadata

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant