CN105677641B - A kind of paper self checking method and system - Google Patents
A kind of paper self checking method and system Download PDFInfo
- Publication number
- CN105677641B CN105677641B CN201610021493.6A CN201610021493A CN105677641B CN 105677641 B CN105677641 B CN 105677641B CN 201610021493 A CN201610021493 A CN 201610021493A CN 105677641 B CN105677641 B CN 105677641B
- Authority
- CN
- China
- Prior art keywords
- user
- word
- test
- participle
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
- G06F40/35—Discourse or dialogue representation
Abstract
The invention provides a kind of paper self checking method and system.User's writing style similarity calculation module is used to calculate active user's writing style similarity, and user's writing style similarity judge module is by active user's writing style similarity SimT(USER) compared with self examination & verification thresholding of systemic presupposition;As user's writing style similarity SimT(USER) when higher than self examination & verification thresholding, you can think that the pending document of active user's submission and user's writing style are inconsistent;As user's writing style similarity SimT(USER) when less than self examination & verification thresholding, you can think that the pending document that active user submits is consistent with user's writing style.
Description
Technical field
The invention belongs to text detection field, more particularly to a kind of paper self checking method and system.
Background technology
Paper plagiarizes detection and refers to judge whether a certain piece paper is accused of plagiarizing in the text of other one or more documents
Hold.But not fully it is equal to duplication due to plagiarizing, but replaces or translate possibly through certain semantic transforms, synonym
The multiple means such as foreign language document are accused of plagiarizing the content of text of other documents.
At present, paper, which plagiarizes detection technique, mainly two methods:One kind is by fingerprint recognition detection method, and one kind is logical
Cross based on paragraph word frequency statisticses detection method in text.So-called fingerprint recognition refers to extract from the source text content of submission
The referred to as data characteristics string of fingerprint, judges whether a certain piece document is copied to other documents according to the identical rate of fingerprint
Attack.So-called paragraph word frequency statisticses detection method refers to segment the text of submission, by counting going out for each paragraph in text
Existing frequency, set after a threshold value by each array of text to be checked compared with each array of query text, finally according to
Accordingly index come judged whether to plagiarize.The above method of the prior art exist a certain degree of discrimination rate it is low, effect
The problems such as rate is not high.
The content of the invention
To overcome above-mentioned the deficiencies in the prior art, the invention provides a kind of paper self checking method and system.
The invention provides a kind of paper self checking method and system.User's writing style similarity calculation module is used to calculate
Active user's writing style similarity, user's writing style similarity judge module is by active user's writing style similarity SimT
(USER) compared with self examination & verification thresholding of systemic presupposition;As user's writing style similarity SimT(USER) higher than described
During self examination & verification thresholding, you can think that the pending document of active user's submission and user's writing style are inconsistent;When user writes
Make style similarity SimT(USER) when less than self examination & verification thresholding, you can think the pending document that active user submits
It is consistent with user's writing style.
Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention,
And can be practiced according to the content of specification, below with presently preferred embodiments of the present invention and coordinate accompanying drawing describe in detail as after.
Brief description of the drawings
Fig. 1 shows the block diagram of paper self-checking system according to an embodiment of the invention;
Fig. 2 shows sliding window detection method according to an embodiment of the invention.
Embodiment
Further to illustrate the present invention to reach the technological means and effect that predetermined goal of the invention is taken, below in conjunction with
Accompanying drawing and preferred embodiment, to according to system and method embodiment, feature and its effect proposed by the present invention, specifically
It is bright as after.In the following description, what different " embodiment " or " embodiment " referred to is not necessarily same embodiment.This
Outside, special characteristic, structure or the feature in one or more embodiments can be combined by any suitable form.
As shown in figure 1, include material subsystem in the paper self-checking system (calling system in the following text) of the present invention;User subsystem;
Doubtful story extraction subsystem;Subsystem is contrasted, wherein the material subsystem, for preparing what is used for plagiarizing detection contrast
Material;User subsystem, user management user login information, and determine user's writing style;Doubtful story extraction subsystem,
For the extraction from comparison database and the doubtful material of document to be identified;Subsystem is contrasted, for by doubtful material and text to be identified
Shelves are contrasted, and generate comparison report.
According to the specific embodiment of the present invention, material subsystem may further include:Comparison database;Segment storehouse,
Participle includes synonymous near synonym storehouse and middle foreign language thesaurus in storehouse;Word-dividing mode;Participle group module;Middle foreign language participle group mould
Block;Segment parts of speech classification module;Participle group parts of speech classification module;Middle foreign language participle group parts of speech classification module;Segment characteristic value life
Into module;Participle group characteristic value generation module;Middle foreign language participle group characteristic value generation module;Segment tightening coefficient generation module;
Participle group tightening coefficient generation module;Middle foreign language participle group tightening coefficient generation module;Segment the generation of tightening coefficient characteristic vector
Module;Participle group tightening coefficient feature vector generation module;Middle foreign language participle group tightening coefficient feature vector generation module;Participle
Free vector dimension determining module;Participle group free vector dimension determining module;Middle foreign language participle group free vector dimension determines
Module;Participle simplifies vector dimension generation module;Participle group simplifies vector dimension generation module;Middle foreign language participle group simplifies vector
Dimension generation module;Segment feature vector generation module;Participle group feature vector generation module;And middle foreign language participle group feature
One or more of vector generation module.
According to the specific embodiment of the present invention, user subsystem may further include:User's access mode is examined
Survey module;User's detection pattern determining module;User's writing style test module;Test pictures word description characteristic value generates mould
Block;Test article word description characteristic value generation module;Test pictures word description feature vector generation module;Test article text
Word description feature vector generation module;Test pictures reference characteristic vector generation module;Test the vector generation of article reference characteristic
Module;User test picture character Expressive Features value generation module;User test picture character Expressive Features vector generation module;
User's picture writing style feature vector generation module;User test article word description characteristic value generation module;User test
Article word description feature vector generation module;User's article writing style and features vector generation module;User's writing style is special
Levy vector generation module;Pending file characteristics value generation module;Pending file characteristics value tag vector generation module;User
Writing style similarity calculation module;User's writing style judge module;In user's writing style structural auxiliary word judge module
It is one or more.
According to the specific embodiment of the present invention, doubtful story extraction subsystem may further include:It is to be identified
Document word-dividing mode;Document participle group module to be identified;Foreign language participle group module in document to be identified;Document to be identified segments word
Property sort module;Document participle group parts of speech classification module to be identified;Foreign language participle group parts of speech classification module in document to be identified;Treat
Identify document participle characteristic value generation module;Document participle group characteristic value generation module to be identified;Foreign language point in document to be identified
Phrase characteristic value generation module;Document to be identified segments tightening coefficient generation module;Document participle group tightening coefficient life to be identified
Into module;Foreign language participle group tightening coefficient generation module in document to be identified;Document to be identified segments tightening coefficient characteristic vector
Generation module;Document participle group tightening coefficient feature vector generation module to be identified;Foreign language participle group is close in document to be identified
Coefficient characteristics vector generation module;Document to be identified segments free vector dimension determining module;Document participle group to be identified is free
Vector dimension determining module;Foreign language participle group free vector dimension determining module in document to be identified;Document participle essence to be identified
Simple vector dimension generation module;Document participle group to be identified simplifies vector dimension generation module;Foreign language segments in document to be identified
Group simplifies vector dimension generation module;Document to be identified segments feature vector generation module;Document participle group feature to be identified to
Measure generation module;Foreign language participle group feature vector generation module in document to be identified;File characteristics vector adjusting module to be identified;
Material characteristic vector adjusting module;Common to plagiarize identification similarity calculation module, identification similarity calculation module is plagiarized in extension;It is more
Languages plagiarize identification similarity calculation module;Document tightening coefficient statistical module to be identified;Material tightening coefficient statistical module;It is public
Formula extraction module;Formula decomposing module;One or more of doubtful story extraction module of tightening coefficient.
According to the specific embodiment of the present invention, contrast subsystem may further include:Sliding window sets mould
Block;Sliding window contrast module and comparison report generation module.
According in the specific embodiment party of the present invention, the system includes comparison database, for including with comparing object
Material.The comparison database further comprises books storehouse, paper storehouse, patent database, formula storehouse, proverb common saying storehouse, proverb storehouse, famous person
The word banks such as well-known saying storehouse, poem storehouse.Wherein, books storehouse is used for the books for including public publication;Paper storehouse be used for include journal article,
Meeting paper, academic dissertation etc.;Patent database is used to include disclosure etc.., it is necessary to further preserve institute when including material
State the source of material, such as the publication date of books, publishing house, author, book number etc.;The date issued of journal article, corresponding phase
The periodical name of periodical, issue, author etc.;The meeting title of meeting paper, Meeting Held place, Meeting Held date, author etc.;Degree
The school of paper, graduate time, degree grade, author etc.;According to the quarry information included, those skilled in the art can
Uniquely to obtain the material.Preferably, the material that comparison database is included is not limited to Chinese material, further comprises foreign language element
Material.Comparison database establish after also need to periodically or non-periodically be safeguarded, supplement newly-increased books, journal article, meeting paper,
Academic dissertation and disclosure etc..Proverb common saying storehouse be used for be embodied in sentence wide-spread between network or masses,
The materials such as phrase.Famous sayings of famous figures storehouse is used to include famous sayings of famous figures material, and poem storehouse is used to include the materials such as poem, word, song, tax.
The purpose that proverb common saying storehouse, famous sayings of famous figures storehouse, poem storehouse etc. are further established in comparison database is by the material of object as a comparison
Scope further expands from traditional books, paper, patent file etc., improves and plagiarizes the comprehensive of detection.People in the art
Member knows that comparison database can also further include other kinds of material, will not be repeated here.
Preferably, comparison database is classified when including material according to material art.According to one of present invention tool
Body embodiment, field designation can use the classification in Chinese library taxonomy, the Chinese library taxonomy totally 5
Basic category, 22 major classes, the mixing number combined using Chinese phonetic alphabet with Arabic numerals, one is represented with a letter
Individual major class, alphabetically reflect the order of major class, marked after letter with numeral.For example, A1 represents Marx, Engels
Works, K6 represent Oceania history, and TN represents electronic technology, the communication technology.To be applicable industrial technology development, to the two of industrial technology
Level classification uses biliteral.Those skilled in the art know, other taxonomic hierarchieses can also be used to carry out field mark to material
Know.
Preferably, comparison database is when including material, to the material included according to title, author, summary and text
Mode is indexed respectively.For establishing incidence relation between the title of each material, author, summary and text each several part,
The remainder of same material can be obtained by any portion therein.
Preferably, comparison database is when including material, carries out extraction duplication to formula present in the material included, and build
Vertical formula storehouse is individually preserved.Each formula in the formula storehouse established with its material being extracted it is relevant,
Its corresponding material can be obtained in full by the formula in formula storehouse.According to the specific embodiment of the present invention, receiving
When recording formula, respective variable parameter and the dependent variable parameter and oeprator of formula are subjected to extraction preservation respectively.According to
The specific embodiment of the present invention, the respective laggard onestep extraction of variable parameter and dependent variable parameter for extracting formula are each
Concrete meaning, dimension and the span of parameter, and preserved respectively.According to the specific embodiment of the present invention,
After the oeprator for extracting formula, middle foreign language textual annotation is further subject to operator.In formula storehouse, that is included is every
One formula preserves the symbol expression of each self-corresponding independent variable parameter and dependent variable parameter, each independent variable, dependent variable
The middle foreign language statement of concrete meaning, dimension and the middle foreign language textual annotation of span and operator AND operator.Right
Purpose than further establishing formula storehouse in storehouse is that the material scope of object as a comparison is further expanded into formula contrast, is carried
Height plagiarizes the comprehensive of detection.Those skilled in the art know that comparison database can also further be entered to the other guide in material
Row extraction, such as chemical formula, gene order etc., will not be repeated here.
According to the specific embodiment of the present invention, the comparison database is stored in different websites using distributed way
Position;Particular station can be chosen when accessing comparison database according to the loading condition of different websites to conduct interviews.Each station statistics are current
The material quantity being extracted in unit interval from comparison database, the material quantity can be the number or material of material
Byte number;Obtain the average load amount of this website;The average load amount of this website is periodically reported doubtful material by each website
Extract subsystem;When the doubtful story extraction subsystem needs extract material from comparison database to be used to choose doubtful material,
A minimum website of average load amount is chosen according to the average load amount of each website reported recently to conduct interviews;List therein
The position period is configured by system;It can be chosen for 5 minutes, 10 minutes, 30 minutes or 60 minutes according to being actually needed.Root
According to the specific embodiment of the present invention, different word banks can be stored in different stations using distributed way in the comparison database
Point position;The site location deposited according to different word banks during comparison database is accessed to conduct interviews respectively.Doubtful story extraction subsystem
System, which needs to extract material from comparison database, to be used for when choosing doubtful material, according to the art or affiliated to be extracted material
Type, different contrast word banks is selected to conduct interviews.
According to the specific embodiment of the present invention, comprising storehouse is segmented in system, for including participle and corresponding part of speech.
The participle storehouse is set in advance by system, and periodic maintenance, is mended and is increased neologisms etc..Preferably, segment storehouse in for it is each segment into
Row unique number, W_ID can be used to represent unique number of a certain participle in storehouse is segmented.Preserve participle in the participle storehouse
Part of speech, such as noun, verb, adjective, number, measure word, pronoun, adverbial word, preposition, conjunction, auxiliary word, interjection and onomatopoeia.
According to the specific embodiment of the present invention, word segmentation result is divided into by notional word and function word according to part of speech, wherein notional word includes
Noun, verb, adjective, number, measure word and pronoun;Function word includes adverbial word, preposition, conjunction, auxiliary word, interjection, onomatopoeia.It is preferred that
Ground, segment in storehouse and further included synonymous near synonym storehouse, wherein the same or like participle of implication is formed into one group, using group as
Unit is numbered.Multiple equivalent in meaning or similar participle corresponds to a participle group #, can represent certain using WG_ID
Unique number of one participle in storehouse is segmented.Preferably, segment in storehouse and further included the synonymous near synonym storehouse of middle foreign language, wherein
The same or like middle foreign language participle of implication is formed one group, is numbered in units of group.It is multiple equivalent in meaning or similar
Middle foreign language participle corresponds to a middle foreign language participle group #, can represent that a certain middle foreign language participle group is segmenting using WFG_ID
Unique number in storehouse.
According to the specific embodiment of the present invention, word-dividing mode is included in system, for being segmented to each material,
And word segmentation result is preserved into comparison database.Preferably, word-dividing mode is compared the part of speech that word segmentation result preserves with participle storehouse
It is right, determine the part of speech of word segmentation result.Preferably, segment parts of speech classification module according to corresponding to word segmentation result part of speech to word segmentation result
Carry out classification processing.
According to the specific embodiment of the present invention, participle group module is included in system, for dividing each material
Word, and participle group result is preserved into comparison database.Preferably, the part of speech that participle group module preserves word segmentation result with participle storehouse
It is compared, it is determined that the part of speech of participle group result.Preferably, participle group parts of speech classification module word according to corresponding to participle group result
Property carries out classification processing to participle group result.
According to the specific embodiment of the present invention, middle foreign language participle group module is included in system, for each material
Segmented, and middle foreign language participle group result is preserved into comparison database.Preferably, middle foreign language participle group module divides middle foreign language
Word result with participle storehouse preserve part of speech be compared, it is determined that in foreign language participle group result part of speech.Preferably, middle foreign language participle
Part of speech centering foreign language participle group result carries out classification processing to group parts of speech classification module corresponding to foreign language participle group result according in.
According to the specific embodiment of the present invention, participle parts of speech classification module, participle group parts of speech classification module and
Middle foreign language participle group parts of speech classification module respectively divides word segmentation result, participle group result and middle foreign language participle group according to part of speech
For A classes notional word, B classes notional word, C classes notional word, D classes notional word and V class function words, wherein A classes notional word includes noun;B class notional words include
Verb, adjective;C classes notional word includes number, measure word;D classes notional word includes pronoun;V classes function word includes adverbial word, preposition, conjunction, helped
Word, interjection, onomatopoeia.Preferably, segment in storehouse and noun is further divided into technical term and common noun.According to this hair
A bright embodiment, word segmentation result is divided into by A1 classes notional word, A2 classes notional word, B classes notional word, C classes reality according to part of speech
Word, D classes notional word and V class function words, wherein A1 classes notional word include technical term noun;A2 classes notional word includes common noun;B classes are real
Word includes verb, adjective;C classes notional word includes number, measure word;D classes notional word includes pronoun;V classes function word include adverbial word, preposition,
Conjunction, auxiliary word, interjection, onomatopoeia.Those skilled in the art can choose different classification processing schemes according to being actually needed.
According to the specific embodiment of the present invention, participle characteristic value generation module counts each participle in corresponding element
The quantity occurred in material, generates participle characteristic value WCV=[W_ID, W_N] corresponding to each participle, and wherein W_ID represents this point
Unique number of the word in storehouse is segmented, W_N represent the total degree that the participle occurs in the material.Preferably, it is contemplated that each
The part of speech of individual participle, participle characteristic value generation module generation participle part of speech feature value WCCV=[W_ID, W_N, W_CHAR], wherein
W_CHAR represents the part of speech of the participle.
According to the specific embodiment of the present invention, participle group characteristic value generation module counts each participle group right
The quantity occurred in material is answered, generates participle group characteristic value WGCV=[WG_ID, WG_N] corresponding to each participle group, wherein
WG_ID represents unique number of the participle group in storehouse is segmented, and WG_N represents the total degree that the participle group occurs in the material.
Preferably, it is contemplated that the part of speech of each participle group, participle group characteristic value generation module generation participle group part of speech feature value WGCCV
=[WG_ID, WG_N, WG_CHAR], wherein WG_CHAR represent the part of speech of the participle group.
According to the specific embodiment of the present invention, middle foreign language participle group characteristic value generation module counts each China and foreign countries
The quantity that literary participle group occurs in corresponding material, generates participle group characteristic value WFGCV corresponding to foreign language participle group in each
=[WFG_ID, WFG_N], wherein WFG_ID represent unique number of the foreign language participle group in storehouse is segmented in this, and WFG_N is represented should
The total degree that middle foreign language participle group occurs in the material.Preferably, it is contemplated that the part of speech of foreign language participle group in each, participle
Foreign language participle group part of speech feature value WFGCCV=[WFG_ID, WFG_N, WFG_CHAR] in the generation module generation of group characteristic value, its
Middle WFG_CHAR represents the part of speech of foreign language participle group in this.
According to the specific embodiment of the present invention, participle tightening coefficient generation module is used to generate the close system of participle
Number.The participle tightening coefficient refers to that same participle is adjacent in whole material and occurs be spaced participle quantity twice.According to
The specific embodiment of the present invention, participle tightening coefficient is expressed as WGC=[G_W_ID_1, G_ corresponding to each participle
W_ID_2 ..., G_W_ID_ (W_N-1)], wherein, G_W_ID_1 represents that the participle occurs and second for the first time in the material
The participle quantity being spaced between appearance, G_W_ID_2 represent that the participle occurs occurring it with third time second in the material
Between the participle quantity that is spaced, G_W_ID_ (W_N-1) represents that the participle the W_N-1 times appearance in the material goes out with the W_N times
The participle quantity being spaced between existing;G_W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N-1) are to divide corresponding to the participle
Word tightening coefficient.According to the specific embodiment of the present invention, participle tightening coefficient feature vector generation module generation participle
Tightening coefficient characteristic vector W GCVE=[W_ID, W_N, W_CHAR, G_W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N-1)],
Wherein W_ID represents unique number of the participle in storehouse is segmented, and W_N represents that the participle of the specific participle in the material is always secondary
Number, W_CHAR represent the part of speech of the participle.By segmenting tightening coefficient, entirety of the specific participle in corresponding material can be known
Distribution situation.
According to the specific embodiment of the present invention, participle group tightening coefficient generation module is close for generating participle group
Coefficient.The participle group tightening coefficient refers to that same participle group is adjacent in whole material and occurs be spaced participle number twice
Amount.According to the specific embodiment of the present invention, participle group tightening coefficient is expressed as WGGC=corresponding to each participle group
[G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1)], wherein, G_WG_ID_1 represents the participle group in the material
The participle quantity that middle first time occurs and is spaced between occurring for second, G_WG_ID_2 represent the participle group in the material
Second of the participle quantity occurred being spaced between occurring for the third time, G_WG_ID_ (WG_N-1) represent the participle group in the element
The participle quantity being spaced in material between the WG_N-1 times appearance and the WG_N times appearance;G_WG_ID_1, G_WG_ID_2 ...,
G_WG_ID_ (WG_N-1) is participle group tightening coefficient corresponding to the participle group.According to the specific embodiment party of the present invention
Formula, participle group tightening coefficient feature vector generation module generation participle group tightening coefficient characteristic vector W GGCVE=[WG_ID, WG_
N, WG_CHAR, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1)], wherein WG_ID represents that the participle group is being divided
Unique number in dictionary, WG_N represent the participle total degree of the specific participle group in the material, and WG_CHAR represents the participle
The part of speech of group.By participle group tightening coefficient, overall distribution situation of the specific participle group in corresponding material can be known.
According to the specific embodiment of the present invention, middle foreign language participle group tightening coefficient generation module is used to generate China and foreign countries
Literary participle group tightening coefficient.The middle foreign language participle group tightening coefficient refers to that same middle foreign language participle group is adjacent in whole material
Occurs be spaced participle quantity twice.According to the specific embodiment of the present invention, foreign language participle group is corresponding in each
Middle foreign language participle group tightening coefficient be expressed as WFGGC=[G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-
1)], wherein, G_WFG_ID_1 represents that foreign language participle group occurs between second of appearance between institute for the first time in the material in this
Every participle quantity, between G_WFG_ID_2 represents in this that foreign language participle group occurs for second in the material and third time occurs
The participle quantity being spaced, G_WFG_ID_ (WFG_N-1) represent that foreign language participle group goes out for the WFG_N-1 times in the material in this
The participle quantity being spaced between now occurring with the WFG_N times;G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_
N-1) it is participle group tightening coefficient corresponding to foreign language participle group in this.According to the specific embodiment of the present invention, China and foreign countries
Foreign language participle group tightening coefficient characteristic vector W FGGCVE=in literary participle group tightening coefficient feature vector generation module generation
[WFG_ID, WFG_N, WFG_CHAR, G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-1)], wherein WFG_
ID represents unique number of the foreign language participle group in storehouse is segmented in this, and WFG_N represents the specific middle foreign language participle group in the material
In participle total degree, WFG_CHAR represents the part of speech of foreign language participle group in this., can be with by middle foreign language participle group tightening coefficient
Know overall distribution situation of the specific middle foreign language participle group in corresponding material.
According to the specific embodiment of the present invention, participle knot of the free vector dimension determining module according to material is segmented
Fruit determines participle free vector dimension WFV;The participle free vector dimension WFV is equal to specific material is segmented after obtain
Different participles quantity.When the length of material is shorter or word segmentation result therein is less, resulting participle freely to
It is less to measure dimension WFV;When the length of material is longer or word segmentation result therein is more, resulting participle free vector dimension
Number WFV is more.
According to the specific embodiment of the present invention, participle group free vector dimension determining module is according to the participle of material
As a result participle group free vector dimension WGFV is determined;The participle group free vector dimension WGFV is equal to and specific material is divided
The quantity of the different participle groups obtained after word.It is resulting when the length of material is shorter or participle group result therein is less
Participle group free vector dimension WGFV it is less;When the length of material is longer or participle group result therein is more, gained
The participle group free vector dimension WGFV arrived is more.
According to the specific embodiment of the present invention, middle foreign language participle group free vector dimension determining module is according to material
Word segmentation result determine in foreign language participle group free vector dimension WFGFV;The middle foreign language participle group free vector dimension WFGFV
Equal to the quantity of foreign language participle group in the difference obtained after being segmented to specific material.When the length of material is shorter or wherein
Middle foreign language participle group result it is less when, resulting middle foreign language participle group free vector dimension WFGFV is less;When a piece for material
Width is longer or when participle group result therein is more, and resulting middle foreign language participle group free vector dimension WFGFV is more.
According to the specific embodiment of the present invention, participle is simplified vector dimension generation module and is used for each material
Participle free vector dimension WFV is simplified, and generation participle simplifies vector dimension RWV.It is described participle simplify vector dimension RWV by
System is specified.Preferably, system specifies participle to simplify vector dimension RWV as 500.Preferably, system specifies participle to simplify vector
Dimension RWV is 800.Preferably, system specifies participle to simplify vector dimension RWV as 1000.
According to the specific embodiment of the present invention, participle simplifies vector dimension generation module and uses extracted at equal intervals method
Participle free vector dimension WFV is simplified.It is as follows to simplify process:Judge whether participle free vector dimension WFV is more than to divide
Word simplifies vector dimension RWV, if it is, participle free vector dimension WFV divided by the system participle specified are simplified into vectorial dimension
Number RWV, and upper rounding operation is carried out to resulting quotient, further obtain simplifying coefficients R EDU;Then in participle free vector
At interval of one characteristic value of REDU-1 extraction in characteristic value corresponding to dimension WFV;After all characteristics extractions, sentence
Whether the quantity of disconnected extracted characteristic value equal to participle simplifies vector dimension RWV;When the quantity for the characteristic value extracted is equal to
When participle simplifies vector dimension RWV, then complete participle free vector dimension WFV and simplify;When the quantity for the characteristic value extracted is small
When participle simplifies vector dimension RWV, then calculate participle and simplify vector dimension RWV and the difference of characteristic value quantity;Do not carried
Random extraction simplifies the vector dimension RWV characteristic values equal with the difference quantities of characteristic value with participle in the characteristic value taken, completes
Participle free vector dimension WFV's simplifies.
According to the specific embodiment of the present invention, participle simplifies vector dimension generation module and uses part of speech screening method pair
Participle free vector dimension WFV is simplified.It is as follows to simplify process:By the characteristic value of word segmentation result according to corresponding participle part of speech
Classified;It is A1 class notional words characteristic value, A2 classes notional word spy by feature value division according to the specific embodiment of the present invention
Value indicative, B class notional words characteristic value, C class notional words characteristic value, D class notional word characteristic values and V class function word characteristic values.Generally, it is considered that notional word
Effect played in the similarity comparison of corresponding characteristic value is bigger, and wherein technical term noun can more embody than common noun
Effective content of material.Count respectively lower eigenvalue of all categories quantity AMOUNT_A1 (quantity of A1 class notional word characteristic values),
(C classes are real by AMOUNT_A2 (quantity of A2 class notional word characteristic values), AMOUNT_B (quantity of B class notional word characteristic values), AMOUNT_C
The quantity of word characteristic value), AMOUNT_D (quantity of D class notional word characteristic values), AMOUNT_V (quantity of V class notional word characteristic values).
Calculate participle and simplify vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+
AMOUNT_V value RWV_S_V);If greater than 0, exit and if this time simplify;If equal to 0, then complete this time to simplify;Such as
Fruit is less than 0, then further calculates participle and simplify vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_
C+AMOUNT_D value RWV_S_D);It is if greater than 0, then random from the characteristic value corresponding to AMOUNT_V to extract and the difference
The equal characteristic value of RWV_S_D quantity, completion are this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then enter
One step calculates the value RWV_S_ that participle simplifies vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C)
C;It is if greater than 0, then random from the characteristic value corresponding to AMOUNT_D to extract the feature equal with difference RWV_S_C quantity
Value, completion are this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then further calculate participle and simplify vectorial dimension
Number RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B) value RWV_S_B;If greater than 0, then from corresponding to AMOUNT_C
Characteristic value in the random extraction characteristic value equal with difference RWV_S_B quantity, complete this time to simplify;If equal to 0, then it is complete
Simplified into this;If less than 0, then further calculate participle and simplify vector dimension RWV-'s (AMOUNT_A1+AMOUNT_A2)
Value RWV_S_A2;It is if greater than 0, then random from the characteristic value corresponding to AMOUNT_B to extract and difference RWV_S_A2 quantity
Equal characteristic value, completion are this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then participle is further calculated
Simplify vector dimension RWV-AMOUNT_A1 value RWV_S_A1;If greater than 0, then from the characteristic value corresponding to AMOUNT_A2
The random extraction characteristic value equal with difference RWV_S_A1 quantity, completion are this time simplified;If equal to 0, then complete this time essence
Letter;If less than 0, then extracted at random from the characteristic value corresponding to AMOUNT_A1 equal with simplifying vector dimension RWV quantity
Characteristic value, completion are this time simplified.
Vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+ are simplified for calculating participle
AMOUNT_D+AMOUNT_V value RWV_S_V) is more than 0 situation, that is, means that the material length is smaller or information content is less,
Therefore be not suitable for being contrasted using characteristic value.
Participle free vector dimension WFV represents that itself dimension is small when simplifying vector dimension RWV less than participle, then other are tieed up
Value under several is equivalent to 0.Such a situation needs Direct Mark in systems, individually includes processing.Such as common saying among the people, famous person
Well-known saying etc., search and use as index.Subsequently usable full text sliding window, which compare in full, to be used.
According to the specific embodiment of the present invention, participle group is simplified vector dimension generation module and is used for each material
Participle group free vector dimension WGFV simplified, generation participle group simplify vector dimension RWGV.The participle group simplify to
Amount dimension RWGV is specified by system.Preferably, system specifies participle group to simplify vector dimension RWGV as 500.Preferably, system refers to
Determine participle group and simplify vector dimension RWGV as 800.Preferably, system specifies participle group to simplify vector dimension RWGV as 1000.
According to the specific embodiment of the present invention, participle group simplifies vector dimension generation module and uses extracted at equal intervals
Method is simplified to participle group free vector dimension WGFV.It is as follows to simplify process:Judging participle group free vector dimension WGFV is
It is no to simplify vector dimension RWGV more than participle group, divide if it is, participle group free vector dimension WGFV divided by system are specified
Phrase simplifies vector dimension RWGV, and carries out upper rounding operation to resulting quotient, further obtains simplifying coefficients R EDU;Then
At interval of one characteristic value of REDU-1 extraction in the characteristic value corresponding to participle group free vector dimension WGFV;As all spies
After value indicative is extracted, judge whether the quantity of extracted characteristic value equal to participle group simplifies vector dimension RWGV;When being carried
When the quantity of the characteristic value taken simplifies vector dimension RWGV equal to participle group, then participle group free vector dimension WGFV essences are completed
Letter;When the quantity for the characteristic value extracted simplifies vector dimension RWGV less than participle group, then calculate participle group and simplify vectorial dimension
Number RWGV and the difference of characteristic value quantity;Random extraction simplifies vector dimension RWGV with participle group in the characteristic value being not extracted by
The characteristic value equal with the difference quantities of characteristic value, complete simplifying for participle group free vector dimension WGFV.
According to the specific embodiment of the present invention, participle group simplifies vector dimension generation module and uses part of speech screening method
Participle group free vector dimension WGFV is simplified.It is as follows to simplify process:Characteristic value is carried out according to corresponding participle part of speech
Classification;It is A1 class notional words characteristic value, A2 class notional word features by feature value division according to the specific embodiment of the present invention
Value, B class notional words characteristic value, C class notional words characteristic value, D class notional word characteristic values and V class function word characteristic values.Generally, it is considered that notional word pair
Effect played in the similarity comparison for the characteristic value answered is bigger, and wherein technical term noun can more embody element than common noun
Effective content of material.Count respectively lower eigenvalue of all categories quantity AMOUNT_A1 (quantity of A1 class notional word characteristic values),
(C classes are real by AMOUNT_A2 (quantity of A2 class notional word characteristic values), AMOUNT_B (quantity of B class notional word characteristic values), AMOUNT_C
The quantity of word characteristic value), AMOUNT_D (quantity of D class notional word characteristic values), AMOUNT_V (quantity of V class notional word characteristic values).
Calculate participle group and simplify vector dimension RWGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+
AMOUNT_V value RWGV_S_V);If greater than 0, exit and if this time simplify;If equal to 0, then complete this time to simplify;
If less than 0, then further calculate participle group and simplify vector dimension RWGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+
AMOUNT_C+AMOUNT_D value RWGV_S_D);If greater than 0, then extracted at random from the characteristic value corresponding to AMOUNT_V
The characteristic value equal with difference RWGV_S_D quantity, completion are this time simplified;If equal to 0, then complete this time to simplify;It is if small
In 0, then further calculate participle and simplify vector dimension RWGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C)
Value RWGV_S_C;It is if greater than 0, then random from the characteristic value corresponding to AMOUNT_D to extract and difference RWGV_S_C numbers
Equal characteristic value is measured, completion is this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then further calculate and divide
Phrase simplifies vector dimension RWGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B) value RWGV_S_B;If greater than 0, then
The random extraction characteristic value equal with difference RWGV_S_B quantity from the characteristic value corresponding to AMOUNT_C, completes this time essence
Letter;If equal to 0, then complete this time to simplify;If less than 0, then further calculate participle group and simplify vector dimension RWGV-
(AMOUNT_A1+AMOUNT_A2) value RWGV_S_A2;It is if greater than 0, then random from the characteristic value corresponding to AMOUNT_B
The extraction characteristic value equal with difference RWGV_S_A2 quantity, completion are this time simplified;If equal to 0, then complete this time to simplify;
If less than 0, then the value RWGV_S_A1 that participle group simplifies vector dimension RWGV-AMOUNT_A1 is further calculated;If greater than
0, then the random extraction characteristic value equal with difference RWGV_S_A1 quantity from the characteristic value corresponding to AMOUNT_A2, is completed
This is simplified;If equal to 0, then complete this time to simplify;It is if less than 0, then random from the characteristic value corresponding to AMOUNT_A1
The characteristic value equal with simplifying vector dimension RWGV quantity is extracted, completion is this time simplified.
Vector dimension RWGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C are simplified for calculating participle group
+ AMOUNT_D+AMOUNT_V) value RWGV_S_V be more than 0 situation, that is, mean that the material length is smaller or information content compared with
It is few, therefore be not suitable for being contrasted using characteristic value.
Participle group free vector dimension WGFV represents that itself dimension is small when simplifying vector dimension RWGV less than participle group, then
Value under other dimensions is equivalent to 0.Such a situation needs Direct Mark in systems, individually includes processing.Such as custom among the people
Language, famous sayings of famous figures etc., search and use as index.Subsequently usable full text sliding window, which compare in full, to be used.
According to the specific embodiment of the present invention, middle foreign language participle group is simplified vector dimension generation module and is used for every
The middle foreign language participle group free vector dimension WFGFV of individual material is simplified, and foreign language participle group simplifies vector dimension in generation
RWFGV.The middle foreign language participle group is simplified vector dimension RWFGV and specified by system.Preferably, foreign language participle group during system is specified
Vector dimension RWFGV is simplified as 500.Preferably, foreign language participle group simplifies vector dimension RWFGV as 800 during system is specified.It is preferred that
Ground, foreign language participle group simplifies vector dimension RWFGV as 1000 during system is specified.
According to the specific embodiment of the present invention, middle foreign language participle group is simplified between vector dimension generation module use etc.
Simplified every extraction method centering foreign language participle group free vector dimension WFGFV.It is as follows to simplify process:Foreign language participle group in judgement
Whether free vector dimension WFGFV more than middle foreign language participle group simplifies vector dimension RWFGV, if it is, middle foreign language is segmented
Foreign language participle group simplifies vector dimension RWFGV during group free vector dimension WFGFV divided by system are specified, and to resulting quotient
Rounding operation is carried out, further obtains simplifying coefficients R EDU;Then corresponding to middle foreign language participle group free vector dimension WFGFV
Characteristic value at interval of REDU-1 extraction one characteristic value;After all characteristics extractions, extracted spy is judged
Whether the quantity of value indicative equal to middle foreign language participle group simplifies vector dimension RWFGV;In the quantity for the characteristic value extracted is equal to
When foreign language participle group simplifies vector dimension RWFGV, then foreign language participle group free vector dimension WFGFV is simplified in completing;When being carried
When the quantity of the characteristic value taken simplifies vector dimension RWFGV less than middle foreign language participle group, then calculate in foreign language participle group simplify to
Measure dimension RWFGV and the difference of characteristic value quantity;Random extraction is simplified with middle foreign language participle group in the characteristic value being not extracted by
Characteristic value equal with the difference quantities of characteristic value vector dimension RWFGV, foreign language participle group free vector dimension WFGFV in completion
Simplify.
According to the specific embodiment of the present invention, middle foreign language participle group simplifies vector dimension generation module and uses part of speech
Screening method centering foreign language participle group free vector dimension WFGFV is simplified.It is as follows to simplify process:By characteristic value according to corresponding
Participle part of speech is classified;It is A1 class notional words characteristic value, A2 by feature value division according to the specific embodiment of the present invention
Class notional word characteristic value, B class notional words characteristic value, C class notional words characteristic value, D class notional word characteristic values and V class function word characteristic values.Generally
Think, the effect played in the similarity comparison of characteristic value corresponding to notional word is bigger, and wherein technical term noun is than generic name
Word can more embody effective content of material.Quantity AMOUNT_A1 (the A1 class notional word characteristic values of lower eigenvalue of all categories are counted respectively
Quantity), AMOUNT_A2 (quantity of A2 class notional word characteristic values), AMOUNT_B (quantity of B class notional word characteristic values), AMOUNT_
C (quantity of C class notional word characteristic values), AMOUNT_D (quantity of D class notional word characteristic values), AMOUNT_V be (V class notional word characteristic values
Quantity).Foreign language participle group simplifies vector dimension RWFGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_ in calculating
C+AMOUNT_D+AMOUNT_V value RWFGV_S_V);If greater than 0, exit and if this time simplify;If equal to 0, then it is complete
Simplified into this;If less than 0, then foreign language participle group simplifies vector dimension RWFGV- (AMOUNT_A1+ in further calculating
AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D value RWFGV_S_D);It is if greater than 0, then right from AMOUNT_V institutes
The random extraction characteristic value equal with difference RWFGV_S_D quantity, completion are this time simplified in the characteristic value answered;If equal to 0,
Then complete this time to simplify;If less than 0, then foreign language participle group simplifies vector dimension RWFGV- (AMOUNT_A1 in further calculating
+ AMOUNT_A2+AMOUNT_B+AMOUNT_C) value RWFGV_S_C;If greater than 0, then from the feature corresponding to AMOUNT_D
The random extraction characteristic value equal with difference RWFGV_S_C quantity, completion are this time simplified in value;If equal to 0, then complete this
It is secondary to simplify;If less than 0, then foreign language participle group simplifies vector dimension RWFGV- (AMOUNT_A1+AMOUNT_ in further calculating
A2+AMOUNT_B value RWFGV_S_B);It is if greater than 0, then random from the characteristic value corresponding to AMOUNT_C to extract and be somebody's turn to do
The equal characteristic value of difference RWFGV_S_B quantity, completion are this time simplified;If equal to 0, then complete this time to simplify;If less than
0, then further calculate the value RWFGV_S_A2 that participle group simplifies vector dimension RWFGV- (AMOUNT_A1+AMOUNT_A2);Such as
Fruit is more than 0, then the random extraction feature equal with difference RWFGV_S_A2 quantity from the characteristic value corresponding to AMOUNT_B
Value, completion are this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then foreign language participle group is smart in further calculating
Simple vector dimension RWFGV-AMOUNT_A1 value RWFGV_S_A1;If greater than 0, then from the characteristic value corresponding to AMOUNT_A2
In the random extraction characteristic value equal with difference RWFGV_S_A1 quantity, complete this time to simplify;If equal to 0, then complete this
It is secondary to simplify;It is if less than 0, then random from the characteristic value corresponding to AMOUNT_A1 to extract and simplify vector dimension RWFGV quantity
Equal characteristic value, completion are this time simplified.
Vector dimension RWFGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+ are simplified for foreign language participle group in calculating
AMOUNT_C+AMOUNT_D+AMOUNT_V value RWFGV_S_V) be more than 0 situation, that is, mean the material length it is smaller or
Information content is less, therefore is not suitable for being contrasted using characteristic value.
Participle group free vector dimension WFGFV represents that itself dimension is small when simplifying vector dimension RWFGV less than participle group,
Then the value under other dimensions is equivalent to 0.Such a situation needs Direct Mark in systems, individually includes processing.It is such as among the people
Common saying, famous sayings of famous figures etc., search and use as index.Subsequently usable full text sliding window, which compare in full, to be used.
According to the specific embodiment of the present invention, participle feature vector generation module simplifies vector dimension according to participle
RWV extracts participle described in each material and simplifies characteristic value generation participle characteristic vector W VE_RWV corresponding to vector dimension RWV;
WVE_RWV=[W_ID1, W_N1 ..., W_IDi, W_Ni ..., W_IDRWV, W_NRWV]
Wherein W_IDi represents unique number of the participle in storehouse is segmented, and W_Ni, represents what the participle occurred in the material
Total degree, the characteristic value using the number as the participle.
According to the specific embodiment of the present invention, participle group feature vector generation module simplifies vector according to participle group
Dimension RWGV extract participle group described in each material simplify characteristic value corresponding to vector dimension RWGV generate participle group feature to
Measure WVE_RWGV;
WVE_RWGV=[WG_ID1, WG_N1 ..., WG_IDi, WG_Ni ..., WG_IDRWGV, WG_NRWGV]
Wherein WG_IDi represents unique number of the participle group in storehouse is segmented, and WG_Ni represents the participle group in the material
The total degree of appearance, the characteristic value using the number as the participle group.
According to the specific embodiment of the present invention, middle foreign language participle group feature vector generation module foreign language point in
Phrase simplifies middle foreign language participle group described in each material of vector dimension RWFGV extractions and simplifies spy corresponding to vector dimension RWFGV
Foreign language participle group characteristic vector W VE_RWFGV in value indicative generation;
WVE_RWFGV=[WFG_ID1, WFG_N1 ..., WFG_IDi, WFG_Ni ..., WFG_IDRWFGV, WFG_
NRWFGV]
Unique number of the foreign language participle group in storehouse is segmented during wherein WFG_IDi is represented, WFG_Ni represent foreign language point in this
The total degree that phrase occurs in the material, the characteristic value using the number as foreign language participle group in this.
According to the specific embodiment of the present invention, system provides the user a variety of access modes.User accesses system,
User's access mode detection module is used for the access mode for detecting active user.
In the specific embodiment of the present invention, user can access system in a manner of on probation, referred to hereinafter as with probation
The user that mode accesses is user on probation.When user's access mode detection module, which detects user, to be accessed in a manner of on probation,
Prompting is sent to user on probation, it is mode on probation to inform current accessed mode, and informs the access right of user on probation.According to this
One embodiment of invention, for the user accessed in a manner of on probation, system is only that user on probation provides book character
Several detections are tried out, and the predetermined number of words is set in advance by system.According to the present invention another embodiment, for
The user that mode on probation accesses, the database that system provides part or all of scope to try out user are tried out for detection.According to this
The another embodiment of invention, for the user accessed in a manner of on probation, system is the plagiarism inspection that user on probation provides
Survey result only provides the prompting of plagiarism rate, does not provide specific plagiarism position and with being contrasted by the plagiarism of plagiarism document.According to
The another embodiment of the present invention, for the user accessed in a manner of on probation, system is the plagiarism that user on probation provides
Testing result provide it is specific plagiarize position, but pair with carrying out Fuzzy processing by the plagiarism contrast of plagiarism document so that try out
User is only capable of knowing the specific plagiarism position of the document itself provided, but None- identified is by the specifying information of plagiarism document.
According to the specific embodiment of the present invention, user accesses system with counting mode, referred to hereinafter as with counting mode
The user of access is counting user.When user's access mode detection module, which detects user, to be accessed with counting mode, to meter
Number user sends prompting, and it is counting mode to inform current accessed mode, and prompts counting user to upload needs and carry out plagiarism contrast
Document.According to the specific embodiment of the present invention, system statistics counts the number of characters that user uploads document, and according to system
The number of characters counted out calculates the expense that this text plagiarizes detection.According to the another embodiment of the present invention, system is
The database that counting user provides part or all of scope is selective, and system selects different database scopes according to user is counted
Calculate the expense that this text plagiarizes detection.
According to the specific embodiment of the present invention, user accesses system with timing mode, referred to hereinafter as with timing mode
The user of access is timing user.When user's access mode detection module, which detects user, to be accessed with timing mode, to meter
When user send prompting, it is timing mode to inform current accessed mode, and prompts timing user current residual to use duration.According to
The another embodiment of the present invention, for timing user, system is timing user on display circle in use
Residue is provided in face in real time to prompt using duration countdown.According to the another embodiment of the present invention, system is timing
The database that user provides part or all of scope is selective.According to the specific embodiment of the present invention, system is according to meter
When user select the number of characters of different database scope and timing user institute uploading detection document, estimate needed for the document
Duration is detected, and prompts timing user remaining using whether duration can complete current plagiarism detection.
It is true by user's detection pattern after timing user logs in the system according to the specific embodiment of the present invention
Cover half block determines to plagiarize detection detection pattern.According to the specific embodiment of the present invention, system provide self audit mode,
It is selective commonly to plagiarize identification pattern, extension plagiarism identification pattern, multilingual plagiarism identification pattern, formula plagiarism identification pattern.
According to the specific embodiment of the present invention, user's detection pattern determining module determines active user's detection pattern
For self audit mode when, user's writing style test module provides the user one or more test pictures, is being advised by user
The word description no less than regulation number of words is carried out online for test pictures in fixing time.Preferably, user's writing style is tested
Module further provides the user one or more test articles, is carried out being no less than regulation word online at the appointed time by user
Several text reviews.The test pictures or test article from test picture library and test library by user's writing style test module
In randomly select.No matter use test pictures or test article, be required for carrying out online word description or comment by user, by
Being limited to the stipulated time can not set long, generally be chosen for 30 minutes or 60 minutes, corresponding word description or text reviews
Regulation number of words is generally chosen for 400 word/30 minute or 800 word/60 minute.Those skilled in the art can be further as needed
Other stipulated times or regulation number of words are set.From the point of view of experimental data, it is specified that the time should not set it is long, to avoid user from not having
There are enough time or unstable networks can not complete accordingly to test;In addition, it is specified that the ratio of number of words and stipulated time are unsuitable too low,
To avoid strictly according to the facts reflecting that user writes custom.Long, corresponding word description or text can not be set by being limited to the stipulated time
The length of word comment is limited, and the only characteristic value and characteristic vector of the word description with on-line testing extraction or text reviews may
Also the writing custom of user can not truly be reflected, it is therefore desirable to which further extraction test pictures describe reference characteristic vector and surveyed
Examination article describe reference characteristic vector, for correct word description or text reviews word deficiency caused by feature to
Measure deviation.
According to the specific embodiment of the present invention, the every width test pictures tested in picture library all have test chart chip base
Quasi- characteristic vector.It is the base that predetermined quantity is randomly selected from different background crowds that the test pictures, which describe reference characteristic vector,
Quasi- tester, the description no less than regulation number of words is carried out with regard to fc-specific test FC picture respectively, gathers all word descriptions, counted
The test pictures word description characteristic value of same test pictures, according to the test pictures word description characteristic value calculate feature to
Amount, and characteristic vector is weighted, obtain the test pictures reference characteristic vector of fc-specific test FC picture.The weighting fortune
Weights in calculation are set by system.The every test article tested in library all has test article reference characteristic vector.It is described
It is the benchmark test personnel that predetermined quantity is randomly selected from different background crowds to test article reference characteristic vector, just special respectively
Location survey examination article carries out the description no less than regulation number of words, gathers all word descriptions, statistics is for same test article
Test article word description characteristic value, characteristic vector calculated according to the test article word description characteristic value, and to feature to
Amount is weighted, and obtains the test article reference characteristic vector of fc-specific test FC article.Weights in the ranking operation by
System is set.
According to the specific embodiment of the present invention, the benchmark that predetermined quantity is randomly selected from different background crowds is surveyed
It during examination personnel, can be chosen according to all ages and classes level, can preferably be divided into 20 years old with the following group, 20-29 year group, 30-39 year
Group, 40-49 year group, more than 50 years old group.So as to collect the crowd of age groups for same test pictures or same test text
Description situation of the chapter no less than regulation number of words.
According to the specific embodiment of the present invention, the benchmark that predetermined quantity is randomly selected from different background crowds is surveyed
It during examination personnel, can be chosen according to different academic backgrounds level, it is large with the following group, undergraduate education group can be preferably divided into undergraduate education
Scholar postgraduate's group, doctoral candidate's group.So as to collect the crowd of different academic backgrounds group for same test pictures or same test text
Description situation of the chapter no less than regulation number of words.
According to the specific embodiment of the present invention, the benchmark that predetermined quantity is randomly selected from different background crowds is surveyed
During examination personnel, can be chosen according to different majors field (can divide professional domain, herein not according to different measuring accuracy demands
Repeat again), so as to collect the crowd of different majors field group for same test pictures or same test article no less than regulation
The description situation of number of words.
According to the specific embodiment of the present invention, test pictures word description characteristic value generation module obtains benchmark and surveyed
The test pictures that examination personnel obtain benchmark test personnel describe text, generate user test picture character Expressive Features value;It is described
Test pictures word description characteristic value includes but is not limited to:Chinese number of words, foreign language number of words, total word number, notional word number, function word number, section
Fall number, bout length distribution situation, sentence number, sentence length distribution situation, synonym, near synonym spread scenarios, function word use
Situation, punctuation mark service condition, part of speech service condition.According to the specific embodiment of the present invention, Chinese number of words refers to
The Chinese character number included in each test pictures word description in addition to punctuation mark, each word of Chinese are designated as a word
Symbol;Foreign language number of words refers to the foreign language number of characters included in each test pictures word description in addition to punctuation mark, foreign language
Each word is designated as a character;Total word number refers to the word sum obtained after being segmented to each test pictures word description, its
The participle storehouse that system can be used to carry for middle Chinese word segmentation is segmented, and foreign language can be according to foreign language writing style, directly using per word
Between space segmented;Notional word number obtains often after referring to participle according to word segmentation result compared with segmenting the part of speech in storehouse
Notional word quantity in one test pictures word description, notional word number can be further divided into Chinese notional word number and foreign language notional word number, its
In, the summation of Chinese notional word number and foreign language notional word number is equal to notional word number;Function word number refers to after segmenting according to word segmentation result and participle
Part of speech in storehouse is compared to obtain the function word quantity in each test pictures word description, during further function word number can be divided into
Literary function word number and foreign language function word number, wherein, the summation of Chinese function word number and foreign language function word number is equal to function word number;Paragraph number refers to often
Paragraph quantity in one test pictures word description;Bout length distribution situation refers in each test pictures word description
Word number and sentence number included in each paragraph;Sentence number refers to the sentence number in each test pictures word description
Amount;Sentence length distribution situation refers to the word number included in each sentence in each test pictures word description;Synonym,
Near synonym spread scenarios refer to the word segmentation result in each test pictures word description being compared with synonymous near synonym storehouse,
The same or like participle of implication is formed into a set, calculates the word quantity in each set, thus reflects that this tests
The synonym of the author of picture character description, near synonym writing custom, if wherein included in synonym or near synonym set
Word number it is more, show that the writing style of the author tends to extend using synonym or near synonym, if synonym or nearly justice
Word number included in set of words is fewer, shows that the writing style of the author tends to not use synonym or near synonym to extend;
Function word service condition refers to the statistical conditions that function word uses in each test pictures word description, including but not limited to each piece
The statistics ranking that function word uses in test pictures word description, the word number being each spaced between different function words, each identical function word
Between the word number that is spaced;Such as " ", " ", the service condition of " obtaining " three structural auxiliary words can also be further counted, thus
Reflect whether the author of this test pictures word description distinguishes use for " ", " ", " obtaining " three structural auxiliary words;Mark
Point symbol service condition refers to the statistical conditions that punctuation mark uses in each test pictures word description, includes but is not limited to
The statistics ranking that punctuate uses in each test pictures word description, the word number being each spaced between different punctuation marks, often
The word number being spaced between individual identical punctuation mark;Part of speech service condition refers to after segmenting according to word segmentation result and the word in participle storehouse
Property be compared to obtain the statistical conditions of each part of speech participle in each test pictures word description, such as respectively obtain noun,
Verb, adjective, number, measure word, pronoun, adverbial word, preposition, conjunction, auxiliary word, the quantity of interjection and onomatopoeia, and each part of speech
Quantity and the ratio of the total word number of this test pictures word description.
According to the specific embodiment of the present invention, test pictures word description characteristic value generation module is according to test chart
Piece word description characteristic value generates test pictures word description characteristic vector.According to the specific embodiment of the present invention, by
System specifies the dimension of the test pictures word description characteristic vector, and particular content every in characteristic vector and row
The order of row.When the dimension of the characteristic vector of the test pictures word description is n, TPCVE=[TPC_ are represented by
1 ..., TPC_m ..., TPC_n], wherein, TPC_1 be test pictures word description characteristic vector in the first entry value, TPC_m
For the m entry value in the characteristic vector of test pictures word description, TPC_n is in the characteristic vector of test pictures word description
N-th entry value.
Preferably, the test pictures word description characteristic vector includes one or more in the following:Middle word
The ratio of number and total word number, foreign language number of words and the ratio of total word number, the ratio of notional word number and total word number, function word number and total word number
Ratio, the ratio of total word number and paragraph number, most long paragraph word number, synonym, the ratio of near synonym spreading number and total word number, punctuate
Symbol is using the ratio of number and total word number, the ratio of noun number and total word number, the ratio of verb number and total word number, adjective number with
The ratio of total word number, the ratio of number number and total word number, the ratio of measure word number and total word number, the ratio of pronoun number and total word number,
The ratio of adverbial word number and total word number, the ratio of preposition number and total word number, the ratio of conjunction number and total word number, auxiliary word number and total word number
Ratio, the ratio of interjection number and total word number, the ratio of onomatopoeia number and total word number.
According to the specific embodiment of the present invention, test pictures reference characteristic vector generation module statistics is for same
The test pictures word description characteristic vector of test;Test pictures word description characteristic vector is weighted, obtains spy
Location survey attempts piece benchmark characteristic vector, and the weights used in the ranking operation are set by system.Preferably, test pictures benchmark
Feature vector generation module can be directed to age groups, academic group and professional domain group, count the test of predetermined quantity respectively
Picture character Expressive Features vector, and be weighted respectively, obtain each age group, each academic group and each professional domain group
Fc-specific test FC picture reference characteristic vector.
Fc-specific test FC picture reference characteristic vector can be expressed as:
Wherein TPCVE_ID represents the test pictures reference characteristic vector that numbering is ID;Tester's quantity on the basis of k;
TPC_1iRepresent the first entry value of the characteristic vector of i-th of benchmark test personnel;TPC_miRepresent i-th benchmark test personnel's
The m entry value of characteristic vector;TPC_niRepresent the n-th entry value of the characteristic vector of i-th of benchmark test personnel;W1,iFor TPC_1i's
Weight coefficient;Wm,iFor TPC_miWeight coefficient;Wn,,iFor TPC_niWeight coefficient.
According to the specific embodiment of the present invention, test article word description characteristic value generation module obtains benchmark and surveyed
The test article that examination personnel obtain benchmark test personnel describes text, generates user test article word description characteristic value;It is described
Test article word description characteristic value includes but is not limited to:Chinese number of words, foreign language number of words, total word number, notional word number, function word number, section
Fall number, bout length distribution situation, sentence number, sentence length distribution situation, synonym, near synonym spread scenarios, function word use
Situation, punctuation mark service condition, part of speech service condition.According to the specific embodiment of the present invention, Chinese number of words refers to
The Chinese character number included in each test article word description in addition to punctuation mark, each word of Chinese are designated as a word
Symbol;Foreign language number of words refers to the foreign language number of characters included in each test article word description in addition to punctuation mark, foreign language
Each word is designated as a character;Word number refers to the word sum obtained after being segmented to each test article word description, wherein
The participle storehouse that carries of system can be used to be segmented for Chinese word segmentation, foreign language can according to foreign language writing style, directly using often word it
Between space segmented;Notional word number refers to be obtained compared with segmenting the part of speech in storehouse according to word segmentation result after participle each
Notional word quantity in piece test article word description, notional word number can be further divided into Chinese notional word number and foreign language notional word number, wherein,
The summation of Chinese notional word number and foreign language notional word number is equal to notional word number;Function word number refers to after segmenting according to word segmentation result with segmenting in storehouse
Part of speech be compared to obtain function word quantity in each test article word description, further function word number can be divided into Chinese void
Word number and foreign language function word number, wherein, the summation of Chinese function word number and foreign language function word number is equal to function word number;Paragraph number refers to each piece
The paragraph quantity tested in article word description;Bout length distribution situation refers to each in each test article word description
Word number and sentence number included in paragraph;Sentence number refers to the sentence quantity in each test article word description;Sentence
Sub- distribution of lengths situation refers to the word number included in each sentence in each test article word description;Synonym, nearly justice
Word spread scenarios refer to each word segmentation result tested in article word description being compared with synonymous near synonym storehouse, will contain
The same or like participle of justice forms a set, calculates the word quantity in each set, thus reflects that this tests article
The synonym of the author of word description, near synonym writing custom, if wherein word included in synonym or near synonym set
Number is more, shows that the writing style of the author tends to extend using synonym or near synonym, if synonym or near synonym collection
Word number included in conjunction is fewer, shows that the writing style of the author tends to not use synonym or near synonym to extend;Function word
Service condition refers to the statistical conditions that function word uses in each test article word description, including but not limited to each test
The statistics ranking that function word uses in article word description, the word number being each spaced between different function words, each between identical function word
The word number at interval;Such as " ", " ", the service condition of " obtaining " three structural auxiliary words can also be further counted, thus reflect
Go out this and test the author of article word description and whether distinguish use for " ", " ", " obtaining " three structural auxiliary words;Punctuate accords with
Number service condition refers to the statistical conditions that punctuation mark uses in each test article word description, including but not limited to each
The statistics ranking that punctuate uses in piece test article word description, the word number being each spaced between different punctuation marks, Mei Gexiang
With the word number being spaced between punctuation mark;Part of speech service condition is entered after referring to participle according to word segmentation result and the part of speech in participle storehouse
Row relatively obtains the statistical conditions of each part of speech participle in each test article word description, for example, respectively obtain noun, verb,
Adjective, number, measure word, pronoun, adverbial word, preposition, conjunction, auxiliary word, the quantity of interjection and onomatopoeia, and each part of speech quantity with
This tests the ratio of the total word number of article word description.
According to the specific embodiment of the present invention, test article word description characteristic value generation module is according to test text
Chapter word description characteristic value generates test pictures word description characteristic vector.According to the specific embodiment of the present invention, by
System specifies the dimension of the test article word description characteristic vector, and particular content every in characteristic vector and row
The order of row.When the dimension of the characteristic vector of the test article word description is n, TTCVE=[TTC_ are represented by
1 ..., TTC_m ..., TTC_n], wherein, TTC_1 be test pictures word description characteristic vector in the first entry value, TTC_m
For the m entry value in the characteristic vector of test pictures word description, TTC_n is in the characteristic vector of test pictures word description
N-th entry value.
Preferably, the test article word description characteristic vector includes one or more in the following:Middle word
The ratio of number and total word number, foreign language number of words and the ratio of total word number, the ratio of notional word number and total word number, function word number and total word number
Ratio, the ratio of total word number and paragraph number, most long paragraph word number, synonym, the ratio of near synonym spreading number and total word number, punctuate
Symbol is using the ratio of number and total word number, the ratio of noun number and total word number, the ratio of verb number and total word number, adjective number with
The ratio of total word number, the ratio of number number and total word number, the ratio of measure word number and total word number, the ratio of pronoun number and total word number,
The ratio of adverbial word number and total word number, the ratio of preposition number and total word number, the ratio of conjunction number and total word number, auxiliary word number and total word number
Ratio, the ratio of interjection number and total word number, the ratio of onomatopoeia number and total word number.
According to the specific embodiment of the present invention, test article reference characteristic vector generation module statistics is for same
The test article word description characteristic vector of test;Test article word description characteristic vector is weighted, obtains spy
Location survey examination article reference characteristic is vectorial, and the weights used in the ranking operation are set by system.Preferably, article benchmark is tested
Feature vector generation module can be directed to age groups, academic group and professional domain group, count the test of predetermined quantity respectively
Article word description characteristic vector, and be weighted respectively, obtain each age group, each academic group and each professional domain group
Fc-specific test FC article reference characteristic vector.
Certain articles reference characteristic vector can be expressed as:
Wherein TTCVE_ID represents the test article reference characteristic vector that numbering is ID;Tester's quantity on the basis of k;
TTC_1iRepresent the first entry value of the characteristic vector of i-th of benchmark test personnel;TTC_miRepresent i-th benchmark test personnel's
The m entry value of characteristic vector;TTC_niRepresent the n-th entry value of the characteristic vector of i-th of benchmark test personnel;W1,iFor TPC_1i's
Weight coefficient;Wm,iFor TPC_miWeight coefficient;Wn,,iFor TPC_niWeight coefficient.
According to the specific embodiment of the present invention, test pictures word description characteristic vector is retouched with test article word
State the dimension of characteristic vector, and the wherein implication of each characteristic value and putting in order is consistent.For example, survey can be set
It is Chinese number of words to attempt piece word description characteristic vector with testing the Section 1 characteristic value in article word description characteristic vector
With the ratio of total word number, Section 2 characteristic value is the ratio of foreign language number of words and total word number, and Section 3 characteristic value is notional word number
With the ratio of total word number, Section 4 characteristic value is the ratio of function word number and total word number, Section 5 characteristic value be total word number with
The ratio of paragraph number, Section 6 characteristic value are most long paragraph word number, and Section 7 characteristic value is synonym, near synonym spreading number
With the ratio of total word number, Section 8 characteristic value is ratio of the punctuation mark using number and total word number, and Section 9 characteristic value is
The ratio of noun number and total word number, Section 10 characteristic value is the ratio of verb number and total word number, and Section 11 characteristic value is
The ratio of adjective number and total word number, Section 12 characteristic value are the ratio of number number and total word number, Section 13 characteristic value
It is the ratio of measure word number and total word number, Section 14 characteristic value is the ratio of pronoun number and total word number, Section 15 Xiang Te
Value indicative is the ratio of adverbial word number and total word number, and Section 16 characteristic value is the ratio of preposition number and total word number, Section 17
Characteristic value is the ratio of conjunction number and total word number, and Section 18 characteristic value is the ratio of auxiliary word number and total word number, and the 19th
Item characteristic value is the ratio of interjection number and total word number, and Section 20 characteristic value is the ratio of onomatopoeia number and total word number.
According to the specific embodiment of the present invention, it can further increase or delete test pictures word description feature
Vector and the characteristic value in test article word description characteristic vector, but the test pictures word after increase or deletion characteristic value is retouched
Characteristic vector is stated to still need to the dimension and the wherein implication of various features value and order for testing article word description characteristic vector
It is consistent.
According to the specific embodiment of the present invention, user test picture character Expressive Features value generation module, which obtains, to be used
Family test pictures describe text, generate user test picture character Expressive Features value;The user test picture character description is special
Value indicative is consistent with the content that test pictures word description characteristic value is included, and will not be repeated here.User test picture character is retouched
State feature vector generation module and user test picture character description spy is calculated according to the user test picture character Expressive Features value
Sign vector;When the dimension of the test pictures word description characteristic vector is n, the active user USER figure for numbering ID
The characteristic vector of the test pictures word description of piece is represented by TPCVE_ID_USER=[TPC_1_USER ..., TPC_m_
USER ..., TPC_n_USER], wherein, TPC_1_USER be active user USER user test picture character Expressive Features to
The first entry value in amount, TPC_m_USER are the m in active user USER user test picture character Expressive Features vector
Entry value, TPC_n_USER are the n-th entry value in active user USER user test picture character Expressive Features vector.
User's picture writing style feature vector generation module calculates user test picture character Expressive Features vector
Difference between TPCVE_ID_USER test pictures reference characteristic vector T PCVE_ID corresponding with the test pictures, uses this
Difference (TPCVE_ID_USER-TPCVE_ID) is used as user's picture writing style feature vector T PCVE_USER.
According to the specific embodiment of the present invention, user test article word description characteristic value generation module, which obtains, to be used
Family test article describes text, generates user test article word description characteristic value;The user test article word description is special
Value indicative is consistent with the content that test article word description characteristic value is included, and will not be repeated here.User test article word is retouched
State feature vector generation module and user test article word description spy is calculated according to the user test article word description characteristic value
Sign vector;When the dimension of the test article word description characteristic vector is n, the active user USER text for numbering ID
The characteristic vector of the test article word description of chapter is represented by:TTCVE_ID_USER=[TTC_1_USER ..., TTC_m_
USER ..., TTC_n_USER], wherein, TTC_1_USER be active user USER user test article word description feature to
The first entry value in amount, TTC_m_USER are the m in active user USER user test article word description characteristic vector
Entry value, TTC_n_USER are the n-th entry value in active user USER user test article word description characteristic vector.
User's article writing style and features vector generation module calculates the user test article word description characteristic vector
Difference between TTCVE_ID_USER test article reference characteristic vector T PCVE_ID corresponding with the test article, uses this
Difference (TTCVE_ID_USER-TTCVE_ID) is used as user's article writing style and features vector T TCVE_USER.
According to the specific embodiment of the present invention, when using several test pictures or more test articles, or together
When Shi Caiyong one or more test pictures and one or more test articles, the life of user test picture character Expressive Features value
Text is described according to every of user test pictures respectively into module and user test article word description characteristic value generation module
And test article describes text generation user test picture and/or article word description characteristic value, user test picture character
Expressive Features vector generation module and user test article word description feature vector generation module are respectively according to user test figure
Piece and/or article word description characteristic value generation user test picture and/or article word description characteristic vector;User's picture is write
Make style feature vector generation module and user's article writing style and features vector generation module calculates each user test figure respectively
Difference between piece and/or article word description characteristic vector and corresponding test pictures and/or article reference characteristic vector;It is right
The picture writing style feature vector T PCVE_USER and the article style for respectively obtaining user is weighted in each difference
Lattice characteristic vector TTCVE_USER;Picture writing style feature vector of user's writing style feature vector generation module to user
TPCVE_USER and article writing style and features vector T TCVE_USER are weighted to obtain user's writing style feature
Vector T VE_USER;The weights of the ranking operation can be chosen according to being actually needed.
TVE_USER=TPCVE_USER*WP+TTCVE_USER*WT
Wherein, WPFor user's picture writing style feature vector T PCVE_USER weight coefficients;WTFor user's article style
Lattice characteristic vector TTCVE_USER weight coefficients.When user only carries out picture writing test or article writing is tested, will can join
1 is arranged to the weight coefficient of project, the weight coefficient for having neither part nor lot in project is arranged to 0.Preferably, weights can be chosen for phase
Deng.
User's writing style feature vector is represented by:TVE_USER=[TVE_1 ..., TVE_m ..., TVE_n], its
In, TVE_1 is the first entry value in user's writing style feature vector, and TVE_m is the m in user's writing style feature vector
Entry value, TVE_n are the n-th entry value in user's writing style feature vector.
According to the specific embodiment of the present invention, user's detection pattern determining module is used to further prompt user
Pass pending document;Pending file characteristics value generation module is used for the pending file characteristics value for generating the unexamined document.
The pending file characteristics value includes but is not limited to:Chinese number of words, foreign language number of words, total word number, notional word number, function word number, paragraph
Number, bout length distribution situation, sentence number, sentence length distribution situation, synonym, near synonym spread scenarios, function word use feelings
Condition, punctuation mark service condition, part of speech service condition.According to the specific embodiment of the present invention, Chinese number of words refers to often
The Chinese character number included in one pending document in addition to punctuation mark, each word of Chinese are designated as a character;Outer word
Number refers to the foreign language number of characters included in the pending document of each piece in addition to punctuation mark, and each word of foreign language is designated as a word
Symbol;Word number refers to the word sum obtained after being segmented to the pending document of each piece, and system can be used certainly in wherein Chinese word segmentation
The participle storehouse of band is segmented, and foreign language can be segmented according to foreign language writing style, the direct space using between every word;Notional word
Number refers to obtain the notional word in the pending document of each piece compared with segmenting the part of speech in storehouse according to word segmentation result after segmenting
Quantity, notional word number can be further divided into Chinese notional word number and foreign language notional word number, wherein, Chinese notional word number is total with foreign language notional word number
With equal to notional word number;Function word number refers to that obtaining each piece compared with segmenting the part of speech in storehouse according to word segmentation result after segmenting treats
The function word quantity in document is audited, further function word number can be divided into Chinese function word number and foreign language function word number, wherein, Chinese function word number
It is equal to function word number with the summation of foreign language function word number;Paragraph number refers to the paragraph quantity in the pending document of each piece;Bout length
Distribution situation refers to the word number and sentence number included in each paragraph in the pending document of each piece;Sentence number refers to each
Sentence quantity in the pending document of a piece;Sentence length distribution situation refers to be wrapped in each sentence in the pending document of each piece
The word number contained;Synonym, near synonym spread scenarios refer to the word segmentation result in the pending document of each piece and synonymous near synonym
Storehouse is compared, and the same or like participle of implication is formed into a set, the word quantity in each set is calculated, thus reflects
Go out synonym, the near synonym writing custom of the author of the pending document of this, if wherein institute in synonym or near synonym set
Comprising word number it is more, show that the writing style of the author tends to extend using synonym or near synonym, if synonym or
Word number included near synonym set is fewer, shows that the writing style of the author tends to not use synonym or near synonym to expand
Exhibition;Function word service condition refers to the statistical conditions that function word uses in the pending document of each piece, and including but not limited to each piece is treated
The statistics ranking that function word uses in examination & verification document, the word number being each spaced between different function words, each it is spaced between identical function word
Word number;Such as " ", " ", the service condition of " obtaining " three structural auxiliary words can also be further counted, thus reflect this
Whether the author of pending document distinguishes use for " ", " ", " obtaining " three structural auxiliary words;Punctuation mark service condition
Refer to the statistical conditions that punctuation mark uses in the pending document of each piece, including but not limited to each pending document acceptance of the bid of a piece
The statistics ranking that point uses, the word number being each spaced between different punctuation marks, the word being each spaced between identical punctuation mark
Number;Part of speech service condition refers to after participle compared with segmenting the part of speech in storehouse to obtain each piece according to word segmentation result pending
The statistical conditions of each part of speech participle in document, for example, respectively obtain noun, verb, adjective, number, measure word, pronoun, adverbial word,
Preposition, conjunction, auxiliary word, the quantity of interjection and onomatopoeia, and each part of speech quantity and the ratio of the total word number of the pending document of this.
According to the specific embodiment of the present invention, pending file characteristics value tag vector generation module is according to pending
Core file characteristics value generates pending file characteristics vector.According to the specific embodiment of the present invention, institute is specified by system
State the dimension of the characteristic vector of pending document, and particular content every in characteristic vector and the order of arrangement;It is pending
The dimension of the characteristic vector of core document, and particular content every in characteristic vector and the order of arrangement should be with test charts
The dimension of piece benchmark characteristic vector and test article reference characteristic vector and the wherein implication of various features value and order are still
It need to be consistent.When the dimension of the characteristic vector of the pending document is n, TDCVE_USER=[TDC_ are represented by
1 ..., TDC_m ..., TDC_n], wherein, TDC_1 is the first entry value in the characteristic vector of pending document, and TDC_m is pending
M entry value in the characteristic vector of core document, TDC_n are the n-th entry value in the characteristic vector of pending document.
Preferably, the characteristic vector of the pending document includes the ratio of Chinese number of words and total word number, foreign language number of words with
The ratio of total word number, the ratio of notional word number and total word number, the ratio of function word number and total word number, the ratio of total word number and paragraph number,
Most long paragraph word number, synonym, the ratio of near synonym spreading number and total word number, punctuation mark use the ratio of number and total word number,
The ratio of noun number and total word number, the ratio of verb number and total word number, the ratio of adjective number and total word number, number number and total word
Several ratio, the ratio of measure word number and total word number, the ratio of pronoun number and total word number, the ratio of adverbial word number and total word number, preposition
The ratio of number and total word number, the ratio of conjunction number and total word number, the ratio of auxiliary word number and total word number, the ratio of interjection number and total word number
Value, the ratio of onomatopoeia number and total word number.
User's writing style similarity calculation module is used to calculate active user's writing style similarity, can pass through following public affairs
Formula calculates:
User's writing style similarity judge module is by active user's writing style similarity SimT(USER) it is pre- with system
If self examination & verification thresholding be compared;As user's writing style similarity SimT(USER) higher than self examination & verification thresholding
When, you can think that the pending document of active user's submission and user's writing style are inconsistent;When user's writing style similarity
SimT(USER) when less than self examination & verification thresholding, you can think that the pending document that active user submits writes wind with user
Lattice are consistent.
Self examination & verification thresholding is that system is set in advance.Self examination & verification threshold value setting is too high, then easily causes erroneous judgement
The pending document and user's writing style that active user submits are inconsistent;Self examination & verification threshold value setting is too low, then easily makes
The pending document submitted into erroneous judgement active user is consistent with user's writing style.Generally, it is described self examination & verification threshold value when by
System carries out selection checking by experiment in advance, and can be adjusted at any time according to running situation by system.
According to the specific embodiment of the present invention, first self examination & verification thresholding and second self examination & verification can be set respectively
Thresholding;Described first self examination & verification thresholding self examination & verification thresholding higher than second;As user's writing style similarity SimT(USER)
Higher than described first during self examination & verification thresholding, you can think that the pending document that active user submits differs with user's writing style
Cause;As user's writing style similarity SimT(USER) less than described second during self examination & verification thresholding, you can think active user
The pending document submitted is consistent with user's writing style;As user's writing style similarity SimT(USER) it is greater than or equal to institute
State second self examination & verification thresholding, and self examination & verification thresholding less than or equal to described first;Further verify user's writing style.
Described first self examination & verification thresholding and second self examination & verification thresholding are that system is set in advance.If first self examination & verification
Threshold value setting is too high, then pending document and the user's writing style for easily causing erroneous judgement active user's submission are inconsistent;The
Two self examination & verification threshold values settings are too low, then easily cause pending document and user's writing style that erroneous judgement active user submits
Unanimously;Section is set excessive between first self examination & verification thresholding and second self examination & verification thresholding, then is easily caused too much again
Verify user's writing style.Generally, described first self examination & verification threshold value and second self examination & verification threshold value are led in advance by system
Cross experiment and carry out selection checking, and can be adjusted at any time according to running situation by system.
According to the specific embodiment of the present invention, further checking user's writing style refers to that user writes wind
Lattice structural auxiliary word judge module;Judge pending document and user test picture describes text and/or user test article is retouched
" ", " ", the service condition of " obtaining " three structural auxiliary words in text are stated, thus reflects the author of the pending document of this
And active user is for " ", " ", the differentiation degree of " obtaining " three structural auxiliary words.User's writing style structural auxiliary word
Judge module judges that pending document " ", " ", the service condition of " obtaining " three structural auxiliary words refer to, counts pending document
" ", " ", the access times of " obtaining " in full text, are designated as T respectively1、T2And T3;Further count in pending document full text
" " after institute with participle part of speech be noun number, be designated as D1;Count in pending document full text " " after institute with point
The part of speech of word is the number of verb, is designated as D2;Count in pending document full text " " after institute with participle part of speech be describe
The number of word, is designated as D3;Calculate " " after institute with participle part of speech be noun number and full text in " " use it is always secondary
Several ratio D1/T1;Calculate " " after institute with number and full text that the part of speech of participle is verb " " using total degree
Ratio D2/T2;It is the ratio using total degree " obtained " in the number and full text of verb with the part of speech of participle to calculate institute after " obtaining "
D3/T3;Calculate " ", " ", " obtain " differentiation coefficient DC_TD.The numerical value for distinguishing coefficient DC_TD is more than or equal to 0, is less than
Or equal to 3.
The user test picture describes text and/or user test article describes in text " ", " ", " obtaining " three
The service condition of structural auxiliary word refers to that counting user test pictures describe text and/or user test article describes text in full
In (such as user test several pictures and/or plurality of articles, being then incorporated as all description text in full) " ",
" ", the access times of " obtaining ", be designated as T respectively1’、T2' and T3’;Further count in pending document full text " " after institute
Part of speech with participle is the number of noun, is designated as D1’;Count in pending document full text " " after be with the part of speech of participle
The number of verb, is designated as D2’;Count in pending document full text " " after with the part of speech of participle be adjectival number,
It is designated as D3’;Calculate " " after institute with participle part of speech be noun number and full text in " " the ratio using total degree
D1’/T1’;Calculate " " after institute with participle part of speech be verb number and full text in " " the ratio using total degree
D2’/T2’;It is the ratio using total degree " obtained " in the number and full text of verb with the part of speech of participle to calculate institute after " obtaining "
D3’/T3’;Calculate " ", " ", " obtain " differentiation coefficient DC_TPT.The numerical value for distinguishing coefficient DC_TPT is more than or equal to 0,
Less than or equal to 3.
User's writing style structural auxiliary word judge module;Calculate and distinguish between coefficient DC_TD and differentiation coefficient DC_TPT
Computing is normalized in drift rate DC-SC, the i.e. absolute value of the difference to distinguishing both coefficient DC_TD and differentiation coefficient DC_TPT.
When DC_SC value is less than or equal to drift rate DC-SC judgement thresholding, then user's writing style structural auxiliary word
Judge module, which judges the author of pending document, and test pictures describe text and/or tests article describes the user of text and exists
Style is consistent in the use of " ", " ", " obtaining " three structural auxiliary words;When DC_SC value is more than drift rate DC-SC judgement
During thresholding, then user's writing style structural auxiliary word judge module judges that the author of pending document and test pictures describe text
And/or test article describe the user of text " ", " ", " obtaining " three structural auxiliary words use on style it is inconsistent.Partially
Shifting degree DC-SC judgement threshold value is configured in advance by system, and can be adjusted at any time according to being actually needed.Pass through system
The experimental data of operation early stage is understood, when DC_SC value is less than or equal to 10%, can preferably reflect pending document
Author and test pictures describe text and/or test article to describe the user of text in " ", " ", " obtaining " three structural auxiliary words
Use on style it is consistent;When DC_SC value is more than 10%, then it is believed that the author of pending document retouches with test pictures
State text and/or test article describe the user of text " ", " ", " obtaining " three structural auxiliary words use on style differ
Cause.
User's writing style judge module is used to work as user's writing style similarity SimT(USER) greater than or equal to described
Second self examination & verification thresholding, and self examination & verification thresholding less than or equal to described first;Further judge to work as by drift rate DC-SC
Whether the pending document and user's writing style that preceding user submits are consistent;When drift rate DC-SC sentencing more than drift rate DC-SC
During disconnected thresholding, it is believed that the pending document and user's writing style that active user submits are inconsistent;Be less than as drift rate DC-SC or
During judgement thresholding equal to drift rate DC-SC, you can think pending document and user's writing style one that active user submits
Cause.
According to the specific embodiment of the present invention, user's access mode detection module prompting user uploads text to be identified
Shelves.
When user's detection pattern determining module judges active user's detection pattern for common plagiarism identification pattern, text to be identified
Shelves word-dividing mode is used to segment document to be identified, obtains word segmentation result;When carrying out word segmentation processing to document to be identified,
Need to use and carry out segmenting identical handling process with the material of comparison database.
According to the specific embodiment of the present invention, document to be identified segments parts of speech classification module;For further obtaining
Obtain part of speech corresponding to word segmentation result.It is consistent with the participle mode classification for the material that comparison database is included to segment parts of speech classification mode.
According to the specific embodiment of the present invention, document participle characteristic value generation module to be identified is used to generate to wait to reflect
Determine document participle characteristic value;The quantity that each participle occurs in corresponding document to be identified is counted, obtains each participle pair
The participle characteristic value WCV_TBI=[W_ID, W_N] answered, wherein W_ID represent unique number of the participle in storehouse is segmented, W_N
Represent the total degree that the participle occurs in the document to be identified.Preferably, it is contemplated that the part of speech of each participle, segmented
Part of speech feature value WCCV_TBI=[W_ID, W_N, W_CHAR], wherein W_ID represent unique number of the participle in storehouse is segmented,
W_N represents the participle total degree of the specific participle in the document to be identified, and W_CHAR represents the part of speech of the participle.
According to the specific embodiment of the present invention, document participle tightening coefficient generation module to be identified is treated for generation
Identify document participle tightening coefficient.According to the specific embodiment of the present invention, the close system of participle corresponding to each participle
Number can be expressed as WGC_TBI=[G_W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N-1)], wherein, G_W_ID_1 is represented
The participle quantity that the participle is spaced between occurring for the first time and occur for second in the document to be identified, G_W_ID_2 are represented
There is the participle quantity being spaced between third time appearance, G_W_ID_ (W_N- second in the document to be identified in the participle
1) represent that the participle participle quantity being spaced between the W_N times appearance occurs the W_N-1 times in the document to be identified;G_
W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N-1) are participle tightening coefficient corresponding to the participle.According to the one of the present invention
Individual embodiment, further participle tightening coefficient corresponding to each participle can be expressed as segmenting in vector form
Tightening coefficient characteristic vector W GCVE_TBI=[W_ID, W_N, W_CHAR, G_W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N-
1)], wherein W_ID represents unique number of the participle in storehouse is segmented, and W_N represents the specific participle in the document to be identified
Participle total degree, W_CHAR represents the part of speech of the participle, and G_W_ID_1 represents the participle in the document to be identified for the first time
The participle quantity for occurring and being spaced between occurring for second, G_W_ID_2 represent the participle second in the document to be identified
There is the participle quantity being spaced between third time appearance, G_W_ID_ (W_N-1) represents the participle in the document to be identified
The participle quantity being spaced between the W_N-1 times appearance and the W_N times appearance.Wherein, G_W_ID_1, G_W_ID_2 ..., G_W_
ID_ (W_N-1) is participle part of speech feature vector tightening coefficient corresponding to the participle.By segmenting characteristic vector tightening coefficient,
Overall distribution situation of the specific participle in corresponding document to be identified can be known, so as in document entirety length mistake to be identified
It is long, or in the case that description viewpoint is scattered, avoid according to participle total degree W_N or according to (W_N/ segments free vector dimension
WFV) screening segments characteristic vector and omits crucial participle characteristic value.Preferably, can also be closely according to participle characteristic vector
Number extracts specific part in a certain document to be identified and is used to contrast.
According to the specific embodiment of the present invention, document to be identified segments free vector dimension determining module, is used for
Participle free vector dimension WFV_TBI is determined according to the word segmentation result of document to be identified.When the length of document to be identified is shorter or
When person's word segmentation result therein is less, resulting participle free vector dimension WFV_TBI is less;When the length of document to be identified
When word segmentation result longer or therein is more, resulting participle free vector dimension WFV_TBI is more.
When user's detection pattern determining module judges that active user's detection pattern plagiarizes identification pattern for extension, text to be identified
Shelves participle group module is used to segment document to be identified, obtains participle group result;The wherein same or like participle of implication
One group is formed, is numbered in units of group.Multiple equivalent in meaning or similar participle corresponds to a participle group #;Right
, it is necessary to carry out participle identical handling process using with the material of comparison database when document to be identified carries out word segmentation processing.
According to the specific embodiment of the present invention, document participle group parts of speech classification module to be identified;For further
Obtain part of speech corresponding to participle group result.The participle group mode classification for the material that participle group parts of speech classification mode is included with comparison database
Unanimously.
According to the specific embodiment of the present invention, document participle group characteristic value generation module to be identified is treated for generation
Identify document participle group characteristic value;The quantity that each participle group occurs in corresponding document to be identified is counted, obtains each
Participle characteristic value WGCV_TBI=[WG_ID, WG_N], wherein WG_ID represent the participle group in storehouse is segmented corresponding to participle group
Unique number, WG_N represents the total degree that the participle group occurs in the document to be identified.Preferably, it is contemplated that each point
The part of speech of phrase, obtains participle group part of speech feature value WGCCV_TBI=[WG_ID, WG_N, WG_CHAR], and wherein WG_ID is represented
Unique number of the participle group in storehouse is segmented, WG_N represent that the participle of the specific participle group in the document to be identified is always secondary
Number, WG_CHAR represent the part of speech of the participle group.
According to the specific embodiment of the present invention, document participle group tightening coefficient generation module to be identified is used to generate
Document to be identified segments tightening coefficient.According to the specific embodiment of the present invention, participle corresponding to each participle group is tight
Close coefficient can be expressed as WGGC_TBI=[G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1)], wherein, G_
WG_ID_1 represents the participle number that the participle group is spaced between occurring for the first time and occur for second in the document to be identified
Amount, G_WG_ID_2 represent that the participle group point being spaced between third time appearance occurs second in the document to be identified
Word quantity, G_WG_ID_ (WG_N-1) represent that the participle group occurs and the W_N times appearance for the W_N-1 times in the document to be identified
Between the participle quantity that is spaced;G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1) are that the participle group is corresponding
Participle group tightening coefficient., can be further by corresponding to each participle group according to the specific embodiment of the present invention
Participle group tightening coefficient is expressed as participle group tightening coefficient characteristic vector W GGCVE_TBI=[WG_ID, WG_ in vector form
N, WG_CHAR, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1)], wherein WG_ID represents that the participle group is being divided
Unique number in dictionary, WG_N represent the participle total degree of the specific participle group in the document to be identified, and WG_CHAR is represented
The part of speech of the participle group, G_WG_ID_1 represent that the participle group occurs in the document to be identified and occur it for the second time for the first time
Between the participle quantity that is spaced, G_WG_ID_2 represents that the participle group occurs with going out for the third time for second in the document to be identified
The participle quantity being spaced between existing, G_WG_ID_ (WG_N-1) represent the participle group the W_N-1 times in the document to be identified
The participle quantity for occurring and being spaced between occurring for the W_N times.Wherein, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_
(WG_N-1) it is participle part of speech feature vector tightening coefficient corresponding to the participle group.It is closely by participle group characteristic vector
Number, overall distribution situation of the specific participle group in corresponding document to be identified can be known, so as in a document entirety piece to be identified
It is long, or in the case that description viewpoint is scattered, avoid according to participle total degree W_N or according to (W_N/ segments free vector
Dimension WFV) screen participle characteristic vector and omit crucial participle characteristic value.Preferably, can also be tight according to participle characteristic vector
Close coefficient extracts specific part in a certain document to be identified and is used to contrast.
According to the specific embodiment of the present invention, document participle group free vector dimension determining module to be identified, use
In determining participle group free vector dimension WGFV_TBI according to the word segmentation result of document to be identified.When document to be identified length compared with
When word segmentation result short or therein is less, resulting participle group free vector dimension WGFV_TBI is less;When text to be identified
The length of shelves is longer or when word segmentation result therein is more, and resulting participle group free vector dimension WGFV_TBI is more.
It is to be identified when user's detection pattern determining module judges active user's detection pattern for multilingual plagiarism identification pattern
Foreign language participle group module is used to segment document to be identified in document, obtains middle foreign language participle group result;Wherein implication phase
Same or similar middle foreign language participle forms one group, is numbered in units of group.Multiple equivalent in meaning or similar middle foreign language point
Word corresponds to a middle foreign language participle group #.To document to be identified carry out word segmentation processing when, it is necessary to using with comparison database
Material carries out segmenting identical handling process.
According to the specific embodiment of the present invention, document participle group parts of speech classification module to be identified;For further
Obtain part of speech corresponding to participle group result.The participle group mode classification for the material that participle group parts of speech classification mode is included with comparison database
Unanimously.
According to the specific embodiment of the present invention, foreign language participle group characteristic value generation module is used in document to be identified
Generate foreign language participle group characteristic value in document to be identified;Foreign language participle group in each is counted in corresponding document to be identified to occur
Quantity, obtain in each participle characteristic value WFGCV_TBI=[WFG_ID, WFG_N] corresponding to foreign language participle group, wherein
WFG_ID represents unique number of the foreign language participle group in storehouse is segmented in this, and WFG_N represents that foreign language participle group is waited to reflect at this in this
Determine the total degree occurred in document.Preferably, it is contemplated that the part of speech of foreign language participle group in each, obtain middle foreign language participle group word
Property characteristic value WFGCCV_TBI=[WFG_ID, WFG_N, WFG_CHAR], wherein FWG_ID represent in this foreign language participle group point
Unique number in dictionary, WFG_N represent the participle total degree of the specific middle foreign language participle group in the document to be identified, WFG_
CHAR represents the part of speech of foreign language participle group in this.
According to the specific embodiment of the present invention, foreign language participle group tightening coefficient generation module is used in document to be identified
Tightening coefficient is segmented in generating foreign language in document to be identified.According to the specific embodiment of the present invention, foreign language in each
Middle foreign language participle tightening coefficient corresponding to participle group can be expressed as WFGGC_TBI=[G_WFG_ID_1, G_WFG_ID_2 ...,
G_WFG_ID_ (WFG_N-1)], wherein, G_WFG_ID_1 represents that foreign language participle group goes out for the first time in the document to be identified in this
The participle quantity being spaced between now occurring with second, G_WFG_ID_2 represent that foreign language participle group is in the document to be identified in this
In second occur and third time occur between the participle quantity that is spaced, G_WFG_ID_ (WFG_N-1) represents foreign language point in this
There is the participle quantity being spaced between the W_N times appearance the W_N-1 times in the document to be identified in phrase;G_WFG_ID_
1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-1) are that middle foreign language participle group is closely corresponding to foreign language participle group in this
Number., can be further by middle foreign language corresponding to foreign language participle group in each point according to the specific embodiment of the present invention
Phrase tightening coefficient is expressed as middle foreign language participle group tightening coefficient characteristic vector W FGGCVE_TBI=[WFG_ in vector form
ID, WFG_N, WFG_CHAR, G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-1)], wherein WFG_ID is represented
Unique number of the foreign language participle group in storehouse is segmented in this, WFG_N represent the specific middle foreign language participle group in the document to be identified
In participle total degree, WFG_CHAR represents the part of speech of foreign language participle group in this, and G_WFG_ID_1 represents foreign language participle group in this
The participle quantity being spaced between occurring for the first time and occur for second in the document to be identified, G_WFG_ID_2 are represented in this
There is the participle quantity being spaced between third time appearance, G_WFG_ second in the document to be identified in foreign language participle group
ID_ (WG_N-1) represents that foreign language participle group the institute between the W_N times appearance occurs the W_N-1 times in the document to be identified in this
The participle quantity at interval.Wherein, G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-1) are foreign language point in this
Participle part of speech feature vector tightening coefficient corresponding to phrase.By middle foreign language participle group characteristic vector tightening coefficient, can know
Overall distribution situation of the specific middle foreign language participle group in corresponding document to be identified.
According to the specific embodiment of the present invention, foreign language participle group free vector dimension determines mould in document to be identified
Block, for foreign language participle group free vector dimension WFGFV_TBI in being determined according to the word segmentation result of document to be identified.When to be identified
The length of document is shorter or when word segmentation result therein is less, resulting middle foreign language participle group free vector dimension WFGFV_
TBI is less;When the length of document to be identified is longer or word segmentation result therein is more, resulting participle group free vector
Dimension WFGFV_TBI is more.
According to the specific embodiment of the present invention, document to be identified participle is simplified vector dimension generation module and is used for pair
The participle free vector dimension WFV_TBI of document to be identified is simplified, and is generated document participle to be identified and is simplified vector dimension
RWV_TBI.The participle is simplified vector dimension RWV_TBI and specified by the system.Preferably, system specifies participle to simplify vector
Dimension RWV_TBI is 500.Preferably, system specifies participle to simplify vector dimension RWV_TBI as 800.Preferably, simplified system
Specified participle simplifies vector dimension RWV_TBI as 1000.
According to the specific embodiment of the present invention, document participle to be identified simplifies vector dimension generation module use etc.
Interval extraction method is simplified to document to be identified participle free vector dimension WFV_TBI.It is as follows to simplify process:Judge to be identified
Whether document participle free vector dimension WFV_TBI, which is more than document to be identified participle, is simplified vector dimension RWV_TBI, if it is,
Document to be identified is then segmented into free vector dimension WFV_TBI divided by simplified system specifies document participle to be identified to simplify vectorial dimension
Number RWV_TBI, and upper rounding operation is carried out to resulting quotient, further obtain document to be identified and simplify coefficients R EDU_
TBI;Then carried in the characteristic value corresponding to document to be identified participle free vector dimension WFV_TBI at interval of REDU_TBI-1
Take a characteristic value;After all characteristics extractions, judge whether the quantity of extracted characteristic value is equal to text to be identified
Shelves participle simplifies vector dimension RWV_TBI;Vectorial dimension is simplified when the quantity for the characteristic value extracted is equal to document to be identified participle
During number RWV_TBI, then complete document participle free vector dimension WFV_TBI to be identified and simplify;When the number for the characteristic value extracted
When amount simplifies vector dimension RWV_TBI less than document to be identified participle, then calculate document participle to be identified and simplify vector dimension
RWV_TBI and the difference of characteristic value quantity;In the characteristic value being not extracted by random extraction and document to be identified participle simplify to
The dimension RWV_TBI characteristic values equal with the difference quantities of characteristic value is measured, completes document participle free vector dimension to be identified
WFV_TBI's simplifies.
According to the specific embodiment of the present invention, document participle to be identified simplifies vector dimension generation module and uses word
Property screening method to document to be identified participle free vector dimension WFV_TBI simplify.It is as follows to simplify process:By characteristic value according to
Corresponding participle part of speech is classified;It is that A1 classes notional word is special by feature value division according to the specific embodiment of the present invention
Value indicative, A2 class notional words characteristic value, B class notional words characteristic value, C class notional words characteristic value, D class notional word characteristic values and V class function word features
Value.Generally, it is considered that the effect played in the similarity comparison of characteristic value corresponding to notional word is bigger, wherein technical term noun than
Common noun can more embody effective content of document to be identified.Quantity AMOUNT_A1 (the A1 of lower eigenvalue of all categories are counted respectively
The quantity of class notional word characteristic value), AMOUNT_A2 (quantity of A2 class notional word characteristic values), AMOUNT_B (B class notional word characteristic values
Quantity), AMOUNT_C (quantity of C class notional word characteristic values), AMOUNT_D (quantity of D class notional word characteristic values), AMOUNT_V (V
The quantity of class notional word characteristic value).Calculate document participle to be identified and simplify vector dimension RWV_TBI- (AMOUNT_A1+AMOUNT_
A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V value RWV_TBI_S_V);If greater than 0, this is exited if
It is secondary to simplify;If equal to 0, then complete this time to simplify;If less than 0, then further calculate document participle to be identified and simplify vector
Dimension RWV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D) value RWV_S_D;If
It is more than 0, then random from the characteristic value corresponding to AMOUNT_V to extract the feature equal with difference RWV_TBI_S_D quantity
Value, completion are this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then document participle to be identified is further calculated
Simplify vector dimension RWV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C) value RWV_TBI_S_C;Such as
Fruit is more than 0, then the random extraction feature equal with difference RWV_TBI_S_C quantity from the characteristic value corresponding to AMOUNT_D
Value, completion are this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then document participle to be identified is further calculated
Simplify vector dimension RWV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B) value RWV_TBI_S_B;If greater than 0,
The then random extraction characteristic value equal with difference RWV_TBI_S_B quantity from the characteristic value corresponding to AMOUNT_C, is completed
This is simplified;If equal to 0, then complete this time to simplify;If less than 0, then further calculate document participle to be identified simplify to
Measure dimension RWV_TBI- (AMOUNT_A1+AMOUNT_A2) value RWV_TBI_S_A2;If greater than 0, then from AMOUNT_B institutes
The random extraction characteristic value equal with difference RWV_TBI_S_A2 quantity, completion are this time simplified in corresponding characteristic value;If
Equal to 0, then complete this time to simplify;If less than 0, then further calculate document participle to be identified and simplify vector dimension RWV_TBI-
AMOUNT_A1 value RWV_TBI_S_A1;It is if greater than 0, then random from the characteristic value corresponding to AMOUNT_A2 to extract and be somebody's turn to do
The equal characteristic value of difference RWV_TBI_S_A1 quantity, completion are this time simplified;If equal to 0, then complete this time to simplify;It is if small
In 0, then random extraction simplifies vector dimension RWV_TBI with document to be identified participle from the characteristic value corresponding to AMOUNT_A1
The equal characteristic value of quantity, completion are this time simplified.
Vector dimension RWV_TBI- (AMOUNT_A1+AMOUNT_A2+ are simplified for calculating document participle to be identified
AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V value RWV_TBI_S_V) is more than 0 situation, that is, means that this is to be identified
Document length is smaller or information content is less, therefore is not suitable for being contrasted using characteristic value.
Document participle free vector dimension WFV_TBI to be identified is less than document to be identified participle and simplifies vector dimension RWV_
During TBI, expression dimension itself is small, then the value under other dimensions is equivalent to 0, can Direct Mark in systems, individually include
Processing.
According to the specific embodiment of the present invention, document participle group to be identified is simplified vector dimension generation module and is used for
The participle group free vector dimension WGFV_TBI of document to be identified is simplified, document participle group to be identified is generated and simplifies vector
Dimension RGWV_TBI.The participle group is simplified vector dimension RWGV_TBI and specified by the system.Preferably, system specifies participle
Group simplifies vector dimension RWGV_TBI as 500.Preferably, system specifies participle group to simplify vector dimension RWGV_TBI as 800.It is excellent
Selection of land, simplified system specify participle group to simplify vector dimension RWGV_TBI as 1000.
According to the specific embodiment of the present invention, document participle group to be identified simplifies the use of vector dimension generation module
Extracted at equal intervals method is simplified to document participle group free vector dimension WGFV_TBI to be identified.It is as follows to simplify process:Judge
Whether document participle group free vector dimension WGFV_TBI to be identified more than document participle group to be identified simplifies vector dimension RWGV_
TBI, if it is, document participle group free vector dimension WGFV_TBI to be identified divided by simplified system are specified into document to be identified
Participle group simplifies vector dimension RWGV_TBI, and carries out upper rounding operation to resulting quotient, further obtains simplifying coefficient
REDU_TBI;Then at interval of REDU_TBI-1 in the characteristic value corresponding to document participle group free vector dimension WGFV to be identified
One characteristic value of individual extraction;After all characteristics extractions, judge whether the quantity of extracted characteristic value is equal to and wait to reflect
Determine document participle group and simplify vector dimension RWGV_TBI;When the quantity for the characteristic value extracted is equal to document participle group to be identified essence
During simple vector dimension RWGV_TBI, then complete document participle group free vector dimension WGFV_TBI to be identified and simplify;When being extracted
The quantity of characteristic value when simplifying vector dimension RWGV_TBI less than document participle group to be identified, then calculate document participle to be identified
Group simplifies vector dimension RWGV_TBI and characteristic value quantity difference;In the characteristic value being not extracted by random extraction with it is to be identified
Document participle group simplifies the vector dimension RWGV_TBI characteristic values equal with the difference quantities of characteristic value, completes document to be identified point
Phrase free vector dimension WGFV_TBI's simplifies.
According to the specific embodiment of the present invention, document participle group to be identified simplifies the use of vector dimension generation module
Part of speech screening method is simplified to document participle group free vector dimension WGFV_TBI to be identified.It is as follows to simplify process:By feature
Value is classified according to corresponding participle group part of speech;It is A1 by feature value division according to the specific embodiment of the present invention
Class notional word characteristic value, A2 class notional words characteristic value, B class notional words characteristic value, C class notional words characteristic value, D class notional word characteristic values and V classes
Function word characteristic value.Generally, it is considered that the effect played in the similarity comparison of characteristic value corresponding to notional word is bigger, wherein technical term
Noun can more embody effective content of document to be identified than common noun.The quantity of lower eigenvalue of all categories is counted respectively
AMOUNT_A1 (quantity of A1 class notional word characteristic values), AMOUNT_A2 (quantity of A2 class notional word characteristic values), AMOUNT_B (B classes
The quantity of notional word characteristic value), AMOUNT_C (quantity of C class notional word characteristic values), the AMOUNT_D (numbers of D class notional word characteristic values
Amount), AMOUNT_V (quantity of V class notional word characteristic values).Calculate document participle group to be identified and simplify vector dimension RWGV_TBI-
(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V) value RWGV_TBI_S_V;Such as
Fruit is more than 0, exits and if this time simplifies;If equal to 0, then complete this time to simplify;If less than 0, then further calculate and treat
Identification document participle group simplifies vector dimension RWGV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+
AMOUNT_D value RWGV_S_D);It is if greater than 0, then random from the characteristic value corresponding to AMOUNT_V to extract and the difference
The equal characteristic value of RWGV_TBI_S_D quantity, completion are this time simplified;If equal to 0, then complete this time to simplify;If less than 0,
Then further calculate document participle group to be identified and simplify vector dimension RWGV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_
B+AMOUNT_C value RWGV_TBI_S_C);If greater than 0, then from the characteristic value corresponding to AMOUNT_D random extraction with
The equal characteristic value of difference RWGV_TBI_S_C quantity, completion are this time simplified;If equal to 0, then complete this time to simplify;If
Less than 0, then further calculate document participle group to be identified and simplify vector dimension RWGV_TBI- (AMOUNT_A1+AMOUNT_A2+
AMOUNT_B value RWGV_TBI_S_B);It is if greater than 0, then random from the characteristic value corresponding to AMOUNT_C to extract and be somebody's turn to do
The equal characteristic value of difference RWGV_TBI_S_B quantity, completion are this time simplified;If equal to 0, then complete this time to simplify;It is if small
In 0, then further calculate document participle group to be identified and simplify vector dimension RWGV_TBI-'s (AMOUNT_A1+AMOUNT_A2)
Value RWV_TBI_S_A2;It is if greater than 0, then random from the characteristic value corresponding to AMOUNT_B to extract and difference RWGV_
The equal characteristic value of TBI_S_A2 quantity, completion are this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then enter
One step calculates the value RWGV_TBI_S_A1 that document participle group to be identified simplifies vector dimension RWGV_TBI-AMOUNT_A1;If
It is more than 0, then random from the characteristic value corresponding to AMOUNT_A2 to extract the spy equal with difference RWGV_TBI_S_A1 quantity
Value indicative, completion are this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then from the spy corresponding to AMOUNT_A1
Random extraction and document participle the group to be identified characteristic value that to simplify vector dimension RWGV_TBI quantity equal, complete this in value indicative
Simplify.
Vector dimension RWGV_TBI- (AMOUNT_A1+AMOUNT_A2+ are simplified for calculating document participle group to be identified
AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V value RWGV_TBI_S_V) is more than 0 situation, that is, means that this waits to reflect
Determine that document length is smaller or information content is less, therefore be not suitable for being contrasted using characteristic value.
Document participle group free vector dimension WGFV_TBI to be identified simplifies vector dimension less than document participle group to be identified
During RWGV_TBI, expression dimension itself is small, then the value under other dimensions is equivalent to 0, can Direct Mark in systems, individually
Include processing.
According to the specific embodiment of the present invention, foreign language participle group simplifies vector dimension generation mould in document to be identified
Block is used to simplify the middle foreign language participle group free vector dimension WFGFV_TBI of document to be identified, generates document to be identified
Middle foreign language participle group simplifies vector dimension RFGWV_TBI.The middle foreign language participle group simplifies vector dimension RWFGV_TBI by described
System is specified.Preferably, foreign language participle group simplifies vector dimension RWFGV_TBI as 500 during system is specified.Preferably, system refers to
Foreign language participle group simplifies vector dimension RWFGV_TBI as 800 in fixed.Preferably, foreign language participle group is simplified during simplified system is specified
Vector dimension RWFGV_TBI is 1000.
According to the specific embodiment of the present invention, foreign language participle group simplifies vector dimension generation mould in document to be identified
Block is simplified using extracted at equal intervals method to foreign language participle group free vector dimension WFGFV_TBI in document to be identified.Simplify
Process is as follows:Judge whether foreign language participle group free vector dimension WFGFV_TBI is more than in document to be identified in document to be identified
Foreign language participle group simplifies vector dimension RWFGV_TBI, if it is, by foreign language participle group free vector dimension in document to be identified
WFGFV_TBI divided by simplified system specify foreign language participle group in document to be identified to simplify vector dimension RWFGV_TBI, and to gained
To quotient carry out upper rounding operation, further obtain simplifying coefficients R EDU_TBI;The then foreign language participle group in document to be identified
At interval of one characteristic value of REDU_TBI-1 extraction in characteristic value corresponding to free vector dimension WFGFV;When all features
After value extraction, judge whether the quantity of extracted characteristic value is equal to foreign language participle group in document to be identified and simplifies vectorial dimension
Number RWFGV_TBI;Vector dimension is simplified when the quantity for the characteristic value extracted is equal to foreign language participle group in document to be identified
During RWFGV_TBI, then complete foreign language participle group free vector dimension WFGFV_TBI in document to be identified and simplify;When what is extracted
When the quantity of characteristic value simplifies vector dimension RWFGV_TBI less than foreign language participle group in document to be identified, then text to be identified is calculated
Foreign language participle group simplifies vector dimension RWFGV_TBI and characteristic value quantity difference in shelves;In the characteristic value being not extracted by with
Machine extraction simplifies the vector dimension RWFGV_TBI spies equal with the difference quantities of characteristic value with foreign language participle group in document to be identified
Value indicative, complete simplifying for foreign language participle group free vector dimension WFGFV_TBI in document to be identified.
According to the specific embodiment of the present invention, foreign language participle group simplifies vector dimension generation mould in document to be identified
Block is simplified using part of speech screening method to foreign language participle group free vector dimension WFGFV_TBI in document to be identified.Simplified
Journey is as follows:Characteristic value is classified according to corresponding middle foreign language participle group part of speech;According to the specific embodiment party of the present invention
Formula, it is A1 class notional words characteristic value, A2 class notional words characteristic value, B class notional words characteristic value, C class notional words characteristic value, D by feature value division
Class notional word characteristic value and V class function word characteristic values.Generally, it is considered that the work played in the similarity comparison of characteristic value corresponding to notional word
With more greatly, wherein technical term noun can more embody effective content of document to be identified than common noun.Count respectively all kinds of
Quantity AMOUNT_A1 (quantity of A1 class notional word characteristic values), the AMOUNT_A2 (numbers of A2 class notional word characteristic values of other lower eigenvalue
Amount), AMOUNT_B (quantity of B class notional word characteristic values), AMOUNT_C (quantity of C class notional word characteristic values), AMOUNT_D (D classes
The quantity of notional word characteristic value), AMOUNT_V (quantity of V class notional word characteristic values).Calculate document participle group to be identified and simplify vector
Dimension RWFGV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V) value
RWFGV_TBI_S_V;If greater than 0, exit and if this time simplify;If equal to 0, then complete this time to simplify;If less than
0, then further calculate foreign language participle group in document to be identified and simplify vector dimension RWFGV_TBI- (AMOUNT_A1+AMOUNT_
A2+AMOUNT_B+AMOUNT_C+AMOUNT_D value RWFGV_S_D);If greater than 0, then from the spy corresponding to AMOUNT_V
The random extraction characteristic value equal with difference RWFGV_TBI_S_D quantity, completion are this time simplified in value indicative;If equal to 0, then
Completion is this time simplified;If less than 0, then further calculate foreign language participle group in document to be identified and simplify vector dimension RWFGV_
TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C) value RWFGV_TBI_S_C;If greater than 0, then from
The random extraction characteristic value equal with difference RWFGV_TBI_S_C quantity, completes this in characteristic value corresponding to AMOUNT_D
It is secondary to simplify;If equal to 0, then complete this time to simplify;If less than 0, then foreign language participle group in document to be identified is further calculated
Simplify vector dimension RWFGV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B) value RWFGV_TBI_S_B;It is if big
It is in 0, then random from the characteristic value corresponding to AMOUNT_C to extract the feature equal with difference RWFGV_TBI_S_B quantity
Value, completion are this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then document China and foreign countries to be identified are further calculated
Literary participle group simplifies vector dimension RWFGV_TBI- (AMOUNT_A1+AMOUNT_A2) value RWV_TBI_S_A2;If greater than
0, then it is random from the characteristic value corresponding to AMOUNT_B to extract the characteristic value equal with difference RWFGV_TBI_S_A2 quantity,
Completion is this time simplified;If equal to 0, then complete this time to simplify;If less than 0, then foreign language in document to be identified is further calculated
Participle group simplifies vector dimension RWFGV_TBI-AMOUNT_A1 value RWGV_TBI_S_A1;If greater than 0, then from AMOUNT_
The random extraction characteristic value equal with difference RWFGV_TBI_S_A1 quantity in characteristic value corresponding to A2, completes this time essence
Letter;If equal to 0, then complete this time to simplify;If less than 0, then from the characteristic value corresponding to AMOUNT_A1 random extraction with
Document participle group to be identified simplifies the equal characteristic value of vector dimension RWFGV_TBI quantity, and completion is this time simplified.
Vector dimension RWFGV_TBI- (AMOUNT_A1+ are simplified for calculating foreign language participle group in document to be identified
AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V value RWFGV_TBI_S_V) is more than 0 situation, i.e.,
Mean that the document length to be identified is smaller or information content is less, therefore be not suitable for being contrasted using characteristic value.
Foreign language participle group free vector dimension WFGFV_TBI is less than foreign language participle group in document to be identified in document to be identified
When simplifying vector dimension RWFGV_TBI, expression dimension itself is small, then the value under other dimensions, can be in systems equivalent to 0
Direct Mark, individually include processing.
Preferably, compared for ease of similarity, the material participle selected in system simplifies vector dimension RWV and text to be identified
The participle of shelves simplifies vector dimension RWV_TBI should be equal;Material participle group simplifies vector dimension RWGV and document to be identified point
Phrase simplifies vector dimension RWGV_TBI should be equal;Foreign language participle group simplifies vector dimension RWFGV and document to be identified in material
Middle foreign language participle group simplify vector dimension RWFGV_TBI should be equal.
According to the specific embodiment of the present invention, document to be identified segments feature vector generation module, according to participle
Simplify in each document to be identified of vector dimension RWV_TBI extractions and simplify vector dimension RWV_ with the document participle to be identified
Characteristic value corresponding to TBI generates document participle characteristic vector W VE_RWV_TBI to be identified, wherein
WVE_RWV_TBI=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV_TBI,W_NRWV_TBI]
Wherein W_IDiRepresent unique number of the participle in storehouse is segmented, W_NiRepresent that the participle goes out in the document to be identified
Existing total degree, the characteristic value using the number as the participle.
According to the specific embodiment of the present invention, user's detection pattern determining module judges active user's detection pattern
During commonly to plagiarize identification pattern, when carrying out similarity comparison, document participle feature vector generation module to be identified, which generates, to be waited to reflect
Determine the participle characteristic vector W VE_RWV_TBI of document;WVE_RWV_TBI=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_
IDRWV_TBI,W_NRWV_TBI], the dimension of the participle characteristic vector of document to be identified is RWV_TBI;Segment feature vector generation module
Generate the participle characteristic vector W VE_RWV of material in comparison database;WVE_RWV=[W_ID1,W_N1,...,W_IDi,W_Ni,...,
W_IDRWV,W_NRWV];Wherein, the dimension RWV_TBI of the participle characteristic vector of document to be identified is equal to the dimension of participle characteristic vector
Number RWV.
It should be noted that although all use W_ID in characteristic vector W VE_RWV_TBI and WVE_RWV is segmentediTable
Show unique number of the participle in storehouse is segmented, W_NiThe total degree that the participle occurs in the document to be identified is represented, and should
Characteristic value of the number as the participle, but should be noted that the W_ID in participle characteristic vector W VE_RWV_TBIiHave very big
May be with the W_ID in WVE_RWViAnd differ.Therefore when carrying out similarity comparison, it is necessary to segment characteristic vector by two
Dimension be adjusted to consistent.
According to the specific embodiment of the present invention, file characteristics vector adjusting module to be identified is special for that will segment
Levy W_ID corresponding to all characteristic values in vectorial WVE_RWV_TBIiValue carries out ascending order or descending according to the numbering in participle storehouse
Arrangement, and the W_ID that will lackiValue insertion, the participle numbering W_ID of insertioniCorresponding characteristic value is 0;Assuming that in participle storehouse
Participle numbering sum is W, then the participle numbering number for needing to insert is W-RWV_TBI, the document to be identified being thus expanded
Segment characteristic vector W VE_RWV_TBI_EXT=[W_IDTBI_EXT_1,W_N TBI_EXT_1,...,W_ID TBI_EXT_i,W_NTBI_EXT_i,...,W_ID TBI_EXT_RWV_TBI,W_N TBI_EXT_RWV_TBI,...,W_ID W,W_N W]。
According to the specific embodiment of the present invention, material characteristic vector adjusting module, for characteristic vector will to be segmented
W_ID corresponding to all characteristic values in WVE_RWViValue carries out ascending order according to the numbering in participle storehouse or descending arranges, and will lack
Few W_IDiValue insertion, the participle numbering W_ID of insertioniCorresponding characteristic value is 0;Assuming that the participle numbering in participle storehouse is total
Number is W, then the participle numbering number for needing to insert is W-RWV, the participle characteristic vector W VE_RWV_EXT=being thus expanded
[W_ID EXT_1,W_N EXT_1,...,W_ID EXT_i,W_N EXT_i,...,W_ID EXT_RWV,W_N EXT_RWV,...,W_ID W,W_
N W]。
By the above-mentioned means, the dimension of the participle characteristic vector of the material in document to be identified and comparison database is all extended
To W, and by carrying out unified arrangement according to the numbering progress ascending order in participle storehouse or descending, so as to two participle characteristic vectors pair
The dimension for the characteristic value answered is consistent.
It is common to plagiarize identification similarity calculation module, calculate between any material in document to be identified and comparison database
Similarity;Calculated by below equation:
According to the specific embodiment of the present invention, user's detection pattern determining module judges active user's detection pattern
When plagiarizing identification pattern for extension, when carrying out similarity comparison, document participle group feature vector generation module generation to be identified is treated
Identify the participle group characteristic vector W VE_RWGV_TBI of document;WVE_RWGV_TBI=[WG_ID1,WG_N1,...,WG_IDi,
WG_Ni,...,WG_IDRWGV_TBI,WG_NRWGV_TBI], the dimension of the participle group characteristic vector of document to be identified is RWGV_TBI;Point
The participle group characteristic vector W VE_RWGV of material in phrase feature vector generation module generation comparison database;WVE_RWGV=[WG_
ID1,WG_N1,...,WG_IDi,WG_Ni,...,WG_IDRWGV,WG_NRWGV];Wherein WG_IDiRepresent participle group in storehouse is segmented
Unique number, WG_NiThe total degree that the participle group occurs in the document to be identified is represented, using the number as the participle group
Characteristic value.Wherein, the dimension RWGV_TBI of the participle group characteristic vector of document to be identified is equal to the dimension of participle group characteristic vector
Number RWGV.
Similar with the common processing procedure for plagiarizing identification pattern, according to the specific embodiment of the present invention, extension is copied
Identification file characteristics vector adjusting module to be identified is attacked, adjusts the document participle group characteristic vector W VE_ to be identified being expanded
RWGV_TBI_EXT=[WG_IDTBI_EXT_1,WG_NTBI_EXT_1,...,WG_ID TBI_EXT_i,WG_N TBI_EXT_i,...,WG_IDTBI_EXT_RWV_TBI,WG_N TBI_EXT_RWGV_TBI,...,WG_ID W,WG_N W];Material characteristic vector adjusting module, adjustment obtain
The participle group characteristic vector W VE_RWGV_EXT=[WG_ID of extensionEXT_1,WG_N EXT_1,...,WG_ID EXT_i,WG_NEXT_i,...,WG_ID EXT_RWV,WG_N EXT_RWGV,...,WG_ID W,W_N W].The participle group characteristic vector W VE_ of extension
RWGV_TBI_EXT=[WG_IDTBI_EXT_1,WG_N TBI_EXT_1,...,WG_ID TBI_EXT_i,WG_N TBI_EXT_i,...,WG_
ID TBI_EXT_RWGV_TBI,WG_NTBI_EXT_RWGV_TBI,...,WG_ID W,WG_N W]。
By the above-mentioned means, the dimension of the participle characteristic vector of the material in document to be identified and comparison database is all extended
To W, and by carrying out unified arrangement according to the numbering progress ascending order in participle storehouse or descending, so as to two participle characteristic vectors pair
The dimension for the characteristic value answered is consistent.
Identification similarity calculation module is plagiarized in extension, is calculated between any material in document to be identified and comparison database
Similarity;Calculated by below equation:
According to the specific embodiment of the present invention, user's detection pattern determining module judges active user's detection pattern
For multilingual plagiarism identification pattern when, when carrying out similarity comparison, foreign language participle group characteristic vector generation mould in document to be identified
Block generates the middle foreign language participle group characteristic vector W VE_RWFGV_TBI of document to be identified;WVE_RWFGV_TBI=[WFG_ID1,
WFG_N1,...,WFG_IDi,WFG_Ni,...,WFG_IDRWFGV_TBI,WFG_NRWFGV_TBI], the middle foreign language participle of document to be identified
The dimension of group characteristic vector is RWFGV_TBI;The middle foreign language point of material in participle group feature vector generation module generation comparison database
Phrase characteristic vector W VE_RWFGV;WVE_RWFGV=[WFG_ID1,WFG_N1,...,WFG_IDi,WFG_Ni,...,WFG_
IDRWFGV,WFG_NRWFGV];Wherein WFG_IDiUnique number of the foreign language participle group in storehouse is segmented, WFG_N in expressioniRepresenting should
The total degree that middle foreign language participle group occurs in the document to be identified, the characteristic value using the number as foreign language participle group in this.
Wherein, the dimension RWFGV_TBI of the middle foreign language participle group characteristic vector of document to be identified is equal to middle foreign language participle group characteristic vector
Dimension RWFGV.
It is similar with the common processing procedure for plagiarizing identification pattern, it is multilingual according to the specific embodiment of the present invention
Plagiarize under identification pattern, file characteristics vector adjusting module to be identified, adjust foreign language in the document to be identified being expanded and segment
Group characteristic vector W VE_RWFGV_TBI_EXT=[WFG_IDTBI_EXT_1,WFG_N TBI_EXT_1,...,WFG_ID TBI_EXT_i,
WFG_N TBI_EXT_i,...,WFG_ID TBI_EXT_RWFGV_TBI,WFG_N TBI_EXT_RWFGV_TBI,...,WFG_ID W,WFG_N W];
Material characteristic vector adjusting module, adjust the participle group characteristic vector W VE_RWFGV_EXT=[WFG_ID being expandedEXT_1,
WFG_N EXT_1,...,WFG_ID EXT_i,WFG_N EXT_i,...,WFG_ID EXT_RWV,WFG_N EXT_RWFGV,...,WFG_IDW,WFG_N W].The participle characteristic vector W VE_RWFGV_TBI_EXT=[WFG_ID of extensionTBI_EXT_1,WFG_NTBI_EXT_1,...,WFG_ID TBI_EXT_i,WFG_N TBI_EXT_i,...,WFG_ID TBI_EXT_RWFGV_TBI,WFG_NTBI_EXT_RWFGV_TBI,...,WFG_ID W,WFG_N W]。
By the above-mentioned means, the dimension of the participle characteristic vector of the material in document to be identified and comparison database is all extended
To W, and by carrying out unified arrangement according to the numbering progress ascending order in participle storehouse or descending, so as to two participle characteristic vectors pair
The dimension for the characteristic value answered is consistent.
It is multilingual to plagiarize identification similarity calculation module, calculate between any material in document to be identified and comparison database
Similarity;Calculated by below equation:
According to the specific embodiment of the present invention, to avoid the dimension after extension excessive, also can will participle feature to
All participle ID in WVE_RWV_TBI are measured as a set;And collect the participle ID in WVE_RWV as another
Close;Or using all participle ID in participle group characteristic vector W VE_RWGV_TBI as a set;And by WVE_RWGV
In participle ID as another gather;Or by all points in middle foreign language participle group characteristic vector W VE_RWFGV_TBI
Word ID is as a set;And gather the participle ID in WVE_RWFGV as another;Two collection conjunction unions obtain total
Segment ID set;Gather according to total participle ID by the dimension of the participle characteristic vector of the material in document to be identified and comparison database
Number is extended, and ID will be segmented corresponding to all characteristic values and carries out ascending order or descending arrangement according to the numbering in participle storehouse, is inserted
Enter and included in total participle ID set and originally itself gathered the W_ID not includediValue, the participle numbering W_ID insertediIt is corresponding
Characteristic value be 0;Or included in the total participle group ID set of insertion and WG_ID that itself original set does not includeiValue, is inserted
The participle numbering WG_ID enterediCorresponding characteristic value is 0;Or included in the total middle foreign language participle group ID set of insertion and original
The WFG_ID that itself set does not includeiValue, the participle numbering WFG_ID insertediCorresponding characteristic value is 0.
According to the access mode of user, there is provided the material of different word banks carries out similarity comparison in comparison database, compares and uses
The mode of traversal, the characteristic vector pickup that will select all materials in scope are come out, and similarity is carried out with document to be identified
Contrast;And contrasted the Similarity value being calculated with predetermined threshold, will when Similarity value is higher than predetermined threshold
Corresponding material records standby as doubtful material.
After the completion of document to be identified and the contrast of all materials, extract all doubtful materials, by document to be identified with it is doubtful
Material is further contrasted.
According to a preferred embodiment of the invention, can will be in proverb common saying storehouse, famous sayings of famous figures storehouse, poem storehouse it is all
Material selectiong is doubtful material.
According to a preferred embodiment of the invention, participle free vector dimension WFV can be simplified vector less than participle
Dimension RWV material selectiong is doubtful material.
According to a preferred embodiment of the invention, participle group free vector dimension WGFV can be simplified less than participle group
Vector dimension RWGV material selectiong is doubtful material.
According to a preferred embodiment of the invention, during can middle foreign language participle group free vector dimension WFGFV be less than
The material selectiong that foreign language participle group simplifies vector dimension RWFGV is doubtful material.
According to a preferred embodiment of the invention, doubtful material can be further chosen by segmenting tightening coefficient.
According to the specific embodiment of the present invention, common plagiarize can be according to point of document to be identified under identification pattern
The participle tightening coefficient of word tightening coefficient and material screens doubtful material.Document tightening coefficient statistical module to be identified is according to this
Participle tightening coefficient characteristic vector W GCVE_TBI=[W_ID, W_N, W_CHAR, G_W_ID_ corresponding to being segmented in document to be identified
1, G_W_ID_2 ..., G_W_ID_i ..., G_W_ID_ (W_N-1)] extraction high density participle, and corresponding position.It is described to wait to reflect
Determine participle part of speech W_CHAR of the document tightening coefficient statistical module in participle tightening coefficient characteristic vector, choose part of speech as in fact
The participle of word, and count the spacing participle total amount of predetermined adjacent quantity participle:Wherein n is predetermined adjacent
Quantity, when the spacing participle total amount of predetermined adjacent quantity participle is less than predetermined close threshold T HGWhen, then record the participle
ID and corresponding position.
According to the specific embodiment of the present invention, extension is plagiarized can be according to point of document to be identified under identification pattern
The participle group tightening coefficient of phrase tightening coefficient and material screens doubtful material.Document tightening coefficient statistical module root to be identified
According to participle tightening coefficient characteristic vector W GGCVE_TBI=[WG_ID, WG_N, WG_ corresponding to participle group in the document to be identified
CHAR, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_i ..., G_WG_ID_ (W_N-1)] extraction high density participle group, and
Corresponding position.Participle group of the document tightening coefficient statistical module to be identified in participle group tightening coefficient characteristic vector
Part of speech WG_CHAR, the participle group that part of speech is notional word is chosen, and count the spacing participle total amount for making a reservation for adjacent quantity participle group:Wherein n is to make a reservation for adjacent quantity, when the spacing participle total amount for making a reservation for adjacent quantity participle group is less than in advance
Fixed close threshold T HGWhen, then record the ID of the participle group and corresponding position.
, can be according to document to be identified under multilingual plagiarism identification pattern according to the specific embodiment of the present invention
The middle foreign language participle group tightening coefficient of middle foreign language participle group tightening coefficient and material screens doubtful material.Document to be identified is close
Coefficients statistics module segments tightening coefficient characteristic vector W FGGCVE_ according to corresponding to middle foreign language participle group in the document to be identified
TBI=[WFG_ID, WFG_N, WFG_CHAR, G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_i ..., G_WFG_ID_
(W_N-1) high density participle group, and corresponding position] are extracted.The document tightening coefficient statistical module to be identified is according to China and foreign countries
Participle group part of speech WFG_CHAR in literary participle group tightening coefficient characteristic vector, choose part of speech and be the participle group of notional word, and count
Make a reservation for the spacing participle total amount of adjacent quantity participle group:Wherein n is to make a reservation for adjacent quantity, when predetermined
The spacing participle total amount of adjacent quantity participle group is less than predetermined close threshold T HGWhen, then record foreign language participle group in this
ID and corresponding position.
The value for making a reservation for adjacent quantity n and close threshold T HGPre-set by system, and can be according to reality
Need to be adjusted;When the spacing participle total amount of predetermined adjacent quantity participle is less than predetermined close threshold T HGWhen, then it can recognize
It is more intensive in relevant position appearance for notional word participle, it is possible to which that concentration elaborates a certain viewpoint, it is necessary to which emphasis is paid close attention to.
It is common to plagiarize under identification pattern, the doubtful story extraction module of tightening coefficient, according between predetermined adjacent quantity participle
It is less than predetermined close threshold T H every participle total amountGWhen, the participle ID that is recorded, extract and all in comparison database include the participle
ID material;Calculate respectively participle tightening coefficient characteristic vector W GCVE=corresponding with participle ID in material [W_ID, W_N,
W_CHAR, G_W_ID_1, G_W_ID_2 ..., G_W_ID_i ..., G_W_ID_ (W_N-1)], the predetermined adjacent quantity participle of statistics
Spacing participle total amount:Wherein n is to make a reservation for adjacent quantity, when the interval point of predetermined adjacent quantity participle
Word total amount is less than predetermined close threshold T HGWhen, then it is doubtful material by the material selectiong.The participle ID is one or more
It is individual, it is one or more according to the material comprising one or more participle ID is extracted for one or more participle ID.
Extension is plagiarized under identification pattern, the doubtful story extraction module of tightening coefficient, according to predetermined adjacent quantity participle group
Spacing participle total amount is less than predetermined close threshold T HGWhen, the participle group ID that is recorded, extract all comprising should in comparison database
Segment the material of ID groups;Participle group tightening coefficient characteristic vector W GGCVE=corresponding with participle group ID in material is calculated respectively
[WG_ID, WG_N, WG_CHAR, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_i ..., G_WG_ID_ (WG_N-1)], system
Meter makes a reservation for the spacing participle total amount of adjacent quantity participle group:Wherein n is to make a reservation for adjacent quantity, when predetermined
The spacing participle group total amount of adjacent quantity participle is less than predetermined close threshold T HGWhen, then it is doubtful element by the material selectiong
Material.The participle group ID is one or more, is extracted according to for one or more participle group ID comprising the one or more point
Phrase ID material is one or more.
Under multilingual plagiarism identification pattern, the doubtful story extraction module of tightening coefficient, according to predetermined adjacent quantity China and foreign countries text
The spacing participle total amount of participle group is less than predetermined close threshold T HGWhen, the middle foreign language participle group ID that is recorded, extraction contrast
All materials for including foreign language participle ID groups in this in storehouse;China and foreign countries corresponding with foreign language participle group ID in this in material are calculated respectively
Literary participle group tightening coefficient characteristic vector W FGGCVE=[WFG_ID, WFG_N, WFG_CHAR, G_WFG_ID_1, G_WFG_ID_
2 ..., G_WFG_ID_i ..., G_WFG_ID_ (WFG_N-1)], the spacing participle of the literary participle group in the predetermined adjacent quantity China and foreign countries of statistics
Total amount:Wherein n is to make a reservation for adjacent quantity, when foreign language segments in the interval of predetermined adjacent quantity participle
Group total amount is less than predetermined close threshold T HGWhen, then it is doubtful material by the material selectiong.The middle foreign language participle group ID is
One or more, extracted according to for foreign language participle group ID in one or more comprising foreign language participle group ID in the one or more
Material for one or more.
By this extracting mode, can total degree occur not high by some in the document to be identified, but may be at certain
Notional word participle and corresponding position described in the collection of a little positions are extracted and further compared.
According to the specific embodiment of the present invention, in the case where formula plagiarizes identification pattern, formulas Extraction module, for inciting somebody to action
Extract the formula in document to be identified;Formula decomposing module, for by respective variable parameter and the dependent variable parameter of formula, fortune
Operator number, the concrete meaning of each parameter, dimension and span are extracted respectively;Formula contrast module, for that will wait to reflect
Determine respective variable parameter and dependent variable parameter, oeprator, the concrete meaning of each parameter, the dimension of formula extracted in document
And span and respective variable parameter and dependent variable parameter, oeprator, each parameter of the formula preserved in formula storehouse
Concrete meaning, dimension and span compared one by one;When the formula in document to be identified respective variable parameter with
And the formula preserved in dependent variable parameter, oeprator, dimension and span and formula storehouse respective variable parameter and
Dependent variable parameter, oeprator, the registration of dimension and span exceed formula comparison threshold T HMATHWhen, by formula
With the material that currently formula is associated by compared with as doubtful material in storehouse.The registration refers to the formula in document to be identified
Compared with the formula in formula storehouse, identical independent variable parameter, dependent variable parameter, oeprator, dimensions number sum with it is to be identified
The independent variable parameter of current formula, dependent variable parameter, oeprator, the ratio of dimensions number sum in document.
According to the specific embodiment of the present invention, document to be identified and doubtful material can be entered using sliding window
Row contrasts in full.The size of sliding window can be configured by system.The size of sliding window directly affects contrast effect, sliding
Dynamic window selection is too small, easily causes erroneous judgement, sliding window selection is excessive, easily causes and fails to judge.The slip step of sliding window
Length is also pre-set by system.As shown in Fig. 2 step S0:Start;S1:Sliding window setup module initializes similar window
Mouth counter CT1=0, Hua Dong Walk long counters CT2=0;Step S2:Sliding window setup module sets document to be identified with doubting
Document original position is respectively positioned on like the sliding window of material;Step S3:Sliding window contrast module contrasts the cunning of document to be identified
The sliding window of dynamic window and doubtful material, the quantity of statistics wherein identical notional word participle;Step S4:Sliding window contrasts mould
Block judges whether the quantity of identical notional word participle is more than or equal to threshold T HW;When more than or equal to threshold value hour counter
Value plus one, i.e. CT1=CT1+ 1, and record the position and cunning for identifying that the sliding window of document is current with the sliding window of doubtful material
Content in dynamic window;Step S5:Sliding window setup module sets the sliding window of doubtful material to slide a sliding step;
Step S6:Sliding window setup module judges whether at document end position;If not end position, then return to step
S3:If end position, then step S11 is gone to;Step S11:Sliding window setup module judges the slip of document to be identified
Whether window is at document end position;If not end position, then step S12 is gone to, if end position, then gone
Toward step S13;Step S12:Sliding window setup module sets the sliding window of doubtful material to return to document original position;Wait to reflect
The sliding window for determining document slides a sliding step, CT2=CT2+ 1 goes to step S3;Step S13:Sliding window contrast module
Calculate similar window counter CT1Numerical value Yu Hua Dong Walk long counters CT2The ratio M of numerical value;S14:Sliding window contrast module is sentenced
Whether disconnected ratio M is more than or equal to predetermined threshold value THm, as M >=THMWhen, then it is assumed that the document to be identified and the doubtful material phase
Seemingly;Work as M<THMWhen, then it is assumed that the document to be identified and the doubtful material are dissimilar;S15:Sliding window contrast module judges
It is no to also have doubtful material to need to contrast, if so, then return to step S1;Step S16 is gone to if not;Step S16:Contrast
Report generation module is generated and exports comparison report, and the identification document and all similar doubtful elements are included in the comparison report
The similar window counter CT of material1Numerical value, Hua Dong Walk long counters CT2Numerical value, and both ratio, the identification document and phase
As doubtful material similar portion particular location and particular content;Step S17:Contrast terminates.
According to the specific embodiment of the present invention, step S3:Sliding window contrast module contrasts document to be identified
The sliding window of sliding window and doubtful material, the quantity of statistics wherein identical notional word participle;Wherein identification is plagiarized common
Under pattern, identical notional word participle refers to that ID of the notional word participle in storehouse is segmented is identical;Wherein in the case where identification pattern is plagiarized in extension,
Identical notional word participle refers to that ID of the notional word participle group in storehouse is segmented is identical;Wherein under multilingual plagiarism identification pattern, phase
With notional word participle refer to that ID of the foreign language participle group in storehouse is segmented is identical in notional word.
According to the specific embodiment of the present invention, step S16:Comparison report generation module exports comparison report, enters
One step includes the content of comparison report according to the different and different of identification pattern.It is common to plagiarize under identification pattern, in comparison report
Particular location and particular content comprising the document to be identified to similar doubtful material similar portion;Document to be identified uses
The form of presentation consistent with similar portion in the similar doubtful material;The word statement used is also completely the same;May
Only indivedual word orders are adjusted;If the document that identified document is plagiarized to it is rewritten, when the degree of rewriting compared with
When big, common identification pattern of plagiarizing possibly can not find its document plagiarized.Extension is plagiarized under identification pattern, in comparison report
Particular location and particular content comprising the document to be identified to similar doubtful material similar portion;If identified document
The document plagiarized to it has carried out synonym or near synonym are rewritten, and when file structure rewriting is little, identification mould is plagiarized in extension
Formula may can also find its document plagiarized.Under multilingual plagiarism identification pattern, the document to be identified is included in comparison report
To the particular location and particular content of similar doubtful material similar portion;If the document that identified document is plagiarized to it
Carry out translation to rewrite, when file structure rewriting degree is little, extension plagiarism identification pattern may can also find it and be plagiarized
Document.
According to the specific embodiment of the present invention, sliding window is located at document original position and refers to sliding window most
Left side overlaps with document original position;Sliding window is located at document end position and refers to that the rightmost side of sliding window and document terminate
Position overlaps.
According to system, operation test, sliding window selection are that four notional words participle sizes are more suitable in advance, sliding window
Size can also select as needed as other sizes.Sliding window slides the step-length of a notional word participle every time during contrast;
(elder generation of notional word participle is not considered now when occurring three in sliding window or more than three notional word participles are identical in comparison process
Order afterwards), then record current location and content of the sliding window in document to be identified and doubtful material.
The above described is only a preferred embodiment of the present invention, any formal limitation not is made to the present invention, though
So the present invention is disclosed above with preferred embodiment, but is not limited to the present invention, any to be familiar with this professional technology people
Member, without departing from the scope of the present invention, when the technology contents using the disclosure above make a little change or modification
For the equivalent embodiment of equivalent variations, as long as being the content without departing from technical solution of the present invention, the technical spirit according to the present invention
Any simple modification, equivalent change and modification made to above example, in the range of still falling within technical solution of the present invention.
Claims (10)
- A kind of 1. paper self-checking system, it is characterised in that including:User's detection pattern determining module and the test of user's writing style Module, wherein,User's detection pattern determining module is used to determine that active user's detection pattern is self audit mode;User's writing style test module provides the user one or more test pictures, by user's pin at the appointed time Test pictures are carried out with the word description no less than regulation number of words online;Wherein every width test pictures all have test pictures benchmark Characteristic vector;The test pictures reference characteristic vector is the benchmark test personnel that predetermined quantity is randomly selected from different background crowds, The description no less than regulation number of words is carried out with regard to fc-specific test FC picture respectively, all word descriptions is gathered, counts same test chart The test pictures word description characteristic value of piece, characteristic vector is calculated according to the test pictures word description characteristic value, and to spy Sign vector is weighted, and obtains the test pictures reference characteristic vector of fc-specific test FC picture;Power in the ranking operation Value is set by system;The test pictures that test pictures word description characteristic value generation module obtains benchmark test personnel describe text, generate user Test pictures word description characteristic value;Test pictures word description characteristic value generation module generates test pictures word according to test pictures word description characteristic value Expressive Features vector;When the dimension of the test pictures word description characteristic vector is n, TPCVE=[TPC_ are expressed as 1 ..., TPC_m ..., TPC_n], wherein, TPC_1 be test pictures word description characteristic vector in the first entry value, TPC_m For the m entry value in the characteristic vector of test pictures word description, TPC_n is in the characteristic vector of test pictures word description N-th entry value;Test pictures word description characteristic vector of the test pictures reference characteristic vector generation module statistics for same test;It is right Test pictures word description characteristic vector is weighted, and obtains fc-specific test FC picture reference characteristic vector, the weighting fortune The weights used in calculation are set by system;Fc-specific test FC picture reference characteristic vector representation is:Wherein TPCVE_ID represents the test pictures reference characteristic vector that numbering is ID;Tester's quantity on the basis of k;TPC_1i Represent the first entry value of the characteristic vector of i-th of benchmark test personnel;TPC_miRepresent the feature of i-th of benchmark test personnel to The m entry value of amount;TPC_niRepresent the n-th entry value of the characteristic vector of i-th of benchmark test personnel;W1,iFor TPC_1iWeighting system Number;Wm,iFor TPC_miWeight coefficient;Wn, iFor TPC_niWeight coefficient;User test picture character Expressive Features value generation module obtains user test picture and describes text, generates user test figure Piece word description characteristic value;User test picture character Expressive Features vector generation module calculates according to the user test picture character Expressive Features value User test picture character Expressive Features vector;It is current to use when the dimension of the test pictures word description characteristic vector is n The characteristic vector of the test pictures word description of the family USER picture for numbering ID is expressed as TPCVE_ID_USER= [TPC_1_USER ..., TPC_m_USER ..., TPC_n_USER], user's picture writing style feature vector generation module calculate User test picture character Expressive Features vector T PCVE_ID_USER test pictures reference characteristics corresponding with the test pictures Difference between vector T PCVE_ID, user's picture writing style is used as using difference TPCVE_ID_USER-TPCVE_ID Characteristic vector TPCVE_USER.
- 2. paper self-checking system according to claim 1,Wherein, TPC_1_USER is the first entry value in active user USER user test picture character Expressive Features vector, TPC_m_USER be active user USER user test picture character Expressive Features vector in m entry value, TPC_n_USER For the n-th entry value in active user USER user test picture character Expressive Features vector.
- 3. paper self-checking system according to claim 1 or 2, the test pictures word description characteristic vector includes following It is one or more in items:Chinese number of words and the ratio of total word number, foreign language number of words and the ratio of total word number, notional word number and total word Several ratio, the ratio of function word number and total word number, the ratio of total word number and paragraph number, most long paragraph word number, synonym, near synonym The ratio of spreading number and total word number, punctuation mark use the ratio of number and total word number, the ratio of noun number and total word number, verb number With the ratio of total word number, the ratio of adjective number and total word number, the ratio of number number and total word number, the ratio of measure word number and total word number Value, the ratio of pronoun number and total word number, the ratio of adverbial word number and total word number, the ratio of preposition number and total word number, conjunction number with it is total The ratio of word number, the ratio of auxiliary word number and total word number, the ratio of interjection number and total word number, the ratio of onomatopoeia number and total word number.
- 4. paper self-checking system according to claim 3, user's detection pattern determining module is used to further prompt user Upload pending document;Pending file characteristics value generation module is used for the pending file characteristics for generating the unexamined document Value;Pending file characteristics value tag vector generation module according to pending file characteristics value generate pending file characteristics to Amount;The dimension of the characteristic vector of pending document, and particular content every in characteristic vector and the order of arrangement and survey Attempt piece benchmark characteristic vector and test the dimension and the wherein implication of various features value and suitable of article reference characteristic vector Sequence still needs to be consistent.
- 5. paper self-checking system according to claim 4, user's writing style similarity calculation module is used to calculate currently User's writing style similarity, is calculated by below equation:User's writing style similarity judge module is by active user's writing style similarity SimT(USER) with systemic presupposition from I audits thresholding and is compared;As user's writing style similarity SimT(USER) when higher than self examination & verification thresholding, that is, recognize The pending document and user's writing style submitted for active user are inconsistent;As user's writing style similarity SimT(USER) During less than self examination & verification thresholding, that is, think that the pending document that active user submits is consistent with user's writing style.
- A kind of 6. paper self checking method, it is characterised in that including:User's detection pattern determining module determines that active user's detection pattern is self audit mode;User's writing style test module provides the user one or more test pictures, by user at the appointed time for surveying Attempt the word description that piece carries out being no less than online regulation number of words;Wherein every width test pictures all have test pictures reference characteristic Vector;The test pictures reference characteristic vector is the benchmark test personnel that predetermined quantity is randomly selected from different background crowds, The description no less than regulation number of words is carried out with regard to fc-specific test FC picture respectively, all word descriptions is gathered, counts same test chart The test pictures word description characteristic value of piece, characteristic vector is calculated according to the test pictures word description characteristic value, and to spy Sign vector is weighted, and obtains the test pictures reference characteristic vector of fc-specific test FC picture;Power in the ranking operation Value is set by system;The test pictures that test pictures word description characteristic value generation module obtains benchmark test personnel describe text, generate user Test pictures word description characteristic value;Test pictures word description characteristic value generation module generates test pictures word according to test pictures word description characteristic value Expressive Features vector;When the dimension of the test pictures word description characteristic vector is n, TPCVE=[TPC_ are expressed as 1 ..., TPC_m ..., TPC_n], wherein, TPC_1 be test pictures word description characteristic vector in the first entry value, TPC_m For the m entry value in the characteristic vector of test pictures word description, TPC_n is in the characteristic vector of test pictures word description N-th entry value;Test pictures word description characteristic vector of the test pictures reference characteristic vector generation module statistics for same test;It is right Test pictures word description characteristic vector is weighted, and obtains fc-specific test FC picture reference characteristic vector, the weighting fortune The weights used in calculation are set by system;Fc-specific test FC picture reference characteristic vector representation is:Wherein TPCVE_ID represents the test pictures reference characteristic vector that numbering is ID;Tester's quantity on the basis of k;TPC_1i Represent the first entry value of the characteristic vector of i-th of benchmark test personnel;TPC_miRepresent the feature of i-th of benchmark test personnel to The m entry value of amount;TPC_niRepresent the n-th entry value of the characteristic vector of i-th of benchmark test personnel;W1,iFor TPC_1iWeighting system Number;Wm,iFor TPC_miWeight coefficient;Wn, iFor TPC_niWeight coefficient;User test picture character Expressive Features value generation module obtains user test picture and describes text, generates user test figure Piece word description characteristic value;User test picture character Expressive Features vector generation module calculates according to the user test picture character Expressive Features value User test picture character Expressive Features vector;It is current to use when the dimension of the test pictures word description characteristic vector is n The characteristic vector of the test pictures word description of the family USER picture for numbering ID is expressed as TPCVE_ID_USER= [TPC_1_USER ..., TPC_m_USER ..., TPC_n_USER], user's picture writing style feature vector generation module calculate User test picture character Expressive Features vector T PCVE_ID_USER test pictures reference characteristics corresponding with the test pictures Difference between vector T PCVE_ID, user's picture writing style is used as using difference TPCVE_ID_USER-TPCVE_ID Characteristic vector TPCVE_USER.
- 7. paper self checking method according to claim 6, whereinWherein, TPC_1_USER is the first entry value in active user USER user test picture character Expressive Features vector, TPC_m_USER be active user USER user test picture character Expressive Features vector in m entry value, TPC_n_USER For the n-th entry value in active user USER user test picture character Expressive Features vector.
- 8. the paper self checking method according to claim 6 or 7, whereinThe test pictures word description characteristic vector includes one or more in the following:Chinese number of words and total word number Ratio, foreign language number of words and the ratio of total word number, the ratio of notional word number and total word number, the ratio of function word number and total word number, total word number Number is used with the ratio of paragraph number, most long paragraph word number, synonym, the ratio of near synonym spreading number and total word number, punctuation mark With the ratio of total word number, the ratio of noun number and total word number, the ratio of verb number and total word number, the ratio of adjective number and total word number Value, the ratio of number number and total word number, the ratio of measure word number and total word number, the ratio of pronoun number and total word number, adverbial word number with it is total The ratio of word number, the ratio of preposition number and total word number, the ratio of conjunction number and total word number, the ratio of auxiliary word number and total word number, sigh The ratio of word number and total word number, the ratio of onomatopoeia number and total word number.
- 9. paper self checking method according to claim 8, whereinUser's detection pattern determining module is used to further prompt user to upload pending document;Pending file characteristics value generation Module is used for the pending file characteristics value for generating the unexamined document;Pending file characteristics value tag vector generation module root Pending file characteristics vector is generated according to pending file characteristics value;The dimension of the characteristic vector of pending document, and feature Every particular content and the order of arrangement and test pictures reference characteristic vector and test article reference characteristic in vector The dimension of vector and the wherein implication of various features value and order still need to be consistent.
- 10. paper self checking method according to claim 9, user's writing style similarity calculation module is used to calculate currently User's writing style similarity, is calculated by below equation:User's writing style similarity judge module is by active user's writing style similarity SimT(USER) with systemic presupposition from I audits thresholding and is compared;As user's writing style similarity SimT(USER) when higher than self examination & verification thresholding, that is, recognize The pending document and user's writing style submitted for active user are inconsistent;As user's writing style similarity SimT(USER) During less than self examination & verification thresholding, that is, think that the pending document that active user submits is consistent with user's writing style.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610021493.6A CN105677641B (en) | 2016-01-13 | 2016-01-13 | A kind of paper self checking method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610021493.6A CN105677641B (en) | 2016-01-13 | 2016-01-13 | A kind of paper self checking method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105677641A CN105677641A (en) | 2016-06-15 |
CN105677641B true CN105677641B (en) | 2018-03-16 |
Family
ID=56300443
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610021493.6A Active CN105677641B (en) | 2016-01-13 | 2016-01-13 | A kind of paper self checking method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105677641B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106250491A (en) * | 2016-08-01 | 2016-12-21 | 北京金和网络股份有限公司 | The method of article automatization examination & verification and system thereof |
CN110008333A (en) * | 2019-04-16 | 2019-07-12 | 中国农业科学院农田灌溉研究所 | A kind of paper preliminary inquiry evaluation method |
CN110472228B (en) * | 2019-07-10 | 2023-04-07 | 哈尔滨工程大学 | Crack detection method based on author writing style |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103257957A (en) * | 2012-02-15 | 2013-08-21 | 深圳市腾讯计算机系统有限公司 | Chinese word segmentation based text similarity identifying method and device |
CN104239285A (en) * | 2013-06-06 | 2014-12-24 | 腾讯科技(深圳)有限公司 | New article chapter detecting method and device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040054520A1 (en) * | 2002-07-05 | 2004-03-18 | Dehlinger Peter J. | Text-searching code, system and method |
-
2016
- 2016-01-13 CN CN201610021493.6A patent/CN105677641B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103257957A (en) * | 2012-02-15 | 2013-08-21 | 深圳市腾讯计算机系统有限公司 | Chinese word segmentation based text similarity identifying method and device |
CN104239285A (en) * | 2013-06-06 | 2014-12-24 | 腾讯科技(深圳)有限公司 | New article chapter detecting method and device |
Non-Patent Citations (1)
Title |
---|
语义分析在汉语相似性文献检测中的应用研究;谈文蓉 等;《四川师范大学学报(自然科学版)》;20100731;第33卷(第4期);第554-558页 * |
Also Published As
Publication number | Publication date |
---|---|
CN105677641A (en) | 2016-06-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105701076A (en) | Thesis plagiarism detection method and system | |
Jayakodi et al. | An automatic classifier for exam questions in Engineering: A process for Bloom's taxonomy | |
Solovyev et al. | Prediction of reading difficulty in Russian academic texts | |
BRPI0913815B1 (en) | computer equipment and method for extracting terms from document data including text segments | |
Sheehan et al. | A two-stage approach for generating unbiased estimates of text complexity | |
CN105701085A (en) | Network duplicate checking method and system | |
CN105677641B (en) | A kind of paper self checking method and system | |
Chamberlain et al. | Phrase Detectives Corpus 1.0 Crowdsourced Anaphoric Coreference. | |
Ronan et al. | Determining light verb constructions in contemporary British and Irish English | |
CN110472203A (en) | A kind of duplicate checking detection method, device, equipment and the storage medium of article | |
Argamon | Computational forensic authorship analysis: Promises and pitfalls | |
CN105701086A (en) | Method and system for detecting literature through sliding window | |
Wadud et al. | Text coherence analysis based on misspelling oblivious word embeddings and deep neural network | |
Rahman et al. | NLP-based automatic answer script evaluation | |
Curtotti et al. | Machine learning for readability of legislative sentences | |
Yan et al. | On the robustness of reading comprehension models to entity renaming | |
CN105550172B (en) | A kind of distributed text detection method and system | |
CN105701213B (en) | A kind of document control methods and system | |
Taerungruang et al. | Constructing an Academic Thai Plagiarism Corpus for Benchmarking Plagiarism Detection Systems. | |
Bian et al. | Detecting spam game reviews on steam with a semi-supervised approach | |
Wieling et al. | Hierarchical spectral partitioning of bipartite graphs to cluster dialects and identify distinguishing features | |
CN105701206B (en) | A kind of document detection method and system based on sampling | |
Chaturvedi et al. | Detecting fake news using machine learning algorithms | |
CN105701077A (en) | Multi-language literature detection method and system | |
Shrestha | Detecting Fake News with Sentiment Analysis and Network Metadata |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |