CN110019674A - A kind of text plagiarizes detection method and system - Google Patents
A kind of text plagiarizes detection method and system Download PDFInfo
- Publication number
- CN110019674A CN110019674A CN201711167027.XA CN201711167027A CN110019674A CN 110019674 A CN110019674 A CN 110019674A CN 201711167027 A CN201711167027 A CN 201711167027A CN 110019674 A CN110019674 A CN 110019674A
- Authority
- CN
- China
- Prior art keywords
- text
- sentence
- fingerprint
- plagiarism
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of texts to plagiarize detection method and system.By way of this method is deleting short sentence and using character fingerprint is truncated, the quantity and length for extracting sentence fingerprint are reduced;Sentence fingerprint is extracted by name, place name, mechanism name, time and some other redundancy in deletion sentence, realize that minor modifications plagiarize the accurate detection of content, name, place name, mechanism name content are such as changed, can also be detected, enhance robustness.Relative to traditional text plagiarism method, technical solution provided by the invention substantially reduces operand, improve detection speed, it is more applicable for the quick-searching in magnanimity (hundred million grades) original text and goes out file to be detected and the same or similar place of copyrighted urtext, and export all texts and corresponding plagiarism degree that it is plagiarized.
Description
Technical field
The invention discloses a kind of texts to plagiarize detection method and system, is related under mass text environment to particular text
Carry out plagiarism detection.Plagiarism detection is carried out under mass text environment, due to needing to handle a large amount of text data and a large amount of
Matching operation, therefore, corresponding method or system need to meet quickly, accurate and have certain Shandong to anti-means of plagiarizing
The requirement such as stick.
Background technique
Granted patent " electronic homework based on paragraph plagiarism detection is counter to plagiarize system and method " (application number
201310631663.9) in, by segmentation, the information such as the word frequency for counting keyword for each paragraph generate a vector, then
The similarity between paragraph is calculated with cosine function.This mode is able to detect out the plagiarism between paragraph, if a piece of article
Plurality of articles have been plagiarized, all articles plagiarized can have been detected.Shortcoming is that this mode is needed to be detected
All all paragraph vectors for having copyright chapters and sections do cosine calculating in all vector paragraphs of article and library, and calculation amount is extremely huge,
If typing has copyright, chapters and sections number is slightly larger, and detection speed will become very slow.
As shown in Figure of description 1, examining application No. is CN201510112689.1, entitled " one
A kind of method of text similarity detection is disclosed in kind paper similarity detection method " patent document.Comprising:
Step (a) carries out Chinese word segmentation to detection text;
Step (b) carries out stop words processing to the text after participle, deletes in the text if belonging to stop words, text
In remaining word belong to keyword;
Step (c) screens sentence, and the sentence by keyword number less than preset value K is deleted;
Step (d) encodes each word in the text after sentence screening by GB2312 coding mode;
Step (e) selects function to delete unnecessary coding the coding, obtains the fingerprint of detection text by fingerprint
Sequence;
The fingerprint sequence is compared with the fingerprint sequence in paper library, if there is continuous overlapping, weighs by step (f)
Folded part is defined as doubtful plagiarism paragraph;
The doubtful part of plagiarism is navigated to the corresponding paragraph of respective document in paper library, passes through character string by step (g)
Matching way is accurately matched, and is confirmed as being defined as plagiarizing paragraph after accurately matching.
This method selects function to carry out fingerprint extraction to the GB2312 coding of sentence by deleting stop words, fingerprint, then
There are all interrogation sequences of copyright chapters and sections in comparison library, there is the part being continuously overlapped to be defined as doubtful plagiarism, finally copied doubtful
Part progress character string is attacked accurately to match.This method calculation amount is relatively small, can also cope with the situation that a text plagiarizes more texts, but
That it finally needs accurate matched character string, for the plagiarism content slightly changed with regard to helpless, as have changed the time,
The text of the information such as point just can not be detected.And this method is also required to compare the fingerprint in all libraries, although in capable of coping with
The typing text of type quantity, but if typing text is excessive, it will become very slow.
As it can be seen that existing text plagiarizes detection scheme since required calculation amount is extremely huge, system operation is slow, robust
Property is poor, the plagiarism content slightly changed can not accurately detected, and is not suitable for being plagiarized in mass text environment
Detection.
Summary of the invention
It can not adapt in mass text environment and in order to overcome existing text to plagiarize detection system for being modified slightly
The defect that content can not be detected accurately is plagiarized, the present invention provides a kind of text plagiarism detection system comprising:
Text subordinate sentence module is used to a text segmentation be several sentences;
Module is refined in sentence screening, the sentence for having divided text subordinate sentence module carries out screening refinement, is avoided to text
Each word in middle each sentence, sentence carries out fingerprint extraction;
Sentence fingerprint extraction module carries out fingerprint extraction for the sentence after refining to screening;
On the one hand local search engine module is used for according to<Text Flag, fingerprint collection>simultaneously guarantees that Text Flag is unique
In the index of mode typing local search engine, original text set refines module, sentence via text subordinate sentence module, sentence screening
One group of group fingerprint collection that fingerprint extraction module is handled building matching fingerprint base, on the other hand for by text to be detected via
The text subordinate sentence module, sentence screening are refined module, one group of fingerprint that sentence fingerprint extraction module is handled and matching and are referred to
Line library is matched, and detects that the sentence that text to be detected possesses identical fingerprints with original text is determined as plagiarizing sentence and export
Testing result;
Plagiarize sentence mark module, the testing result label for exporting according to local search engine module plagiarizes text
Plagiarize sentence and by corresponding sentence in plagiarism text.
Further, subordinate sentence module is using all non-Chinese, non-English, non-numeric symbol occurred in text as separator handle
At a rule sentence, detection matching is carried out as unit of sentence to be had very one text segmentation in the case of a text plagiarizes more texts
Good detection effect.Module is refined in sentence screening, and on the one hand short sentence is deleted, and deleting short sentence can be " if any duplicating, sheerly ingeniously
The probability of conjunction " reduces, and can greatly reduce fingerprint total number to reduce search engine pressure;On the other hand it deletes in sentence
Name, place name, mechanism name, time etc. name entity information, these information by two layers filtering: first layer, using by condition with
The Named Entity Extraction Model of airport training identifies name entity, be determined as redundancy and delete that the second layer establishes one
Conventional dictionary (segments all texts in one big text library, picks out the highest part word of frequency as common
Word), then sentence is segmented, if word segmentation result includes the words not having in conventional dictionary, is determined as redundancy words and deletes.Sentence
Garbled sentence is become fingerprint by sub fingerprint extraction module, is specifically included: the original finger of every words is extracted using MD5 algorithm
Line intercepts certain length, and number is referred to then under the premise of guaranteeing that total fingerprint number multiplicity is sufficiently small from original fingerprint
Line is mapped as character fingerprint (i.e. each multiple numbers of character representation), thus reach shorten fingerprint length, reduce local search draw
The effect of the index size, raising search efficiency held up.Local search engine module is based on the realization of Lucene Technical Architecture, on the one hand
For the new original text of typing, the plagiarism text of text to be detected is on the other hand searched for, can be carried out simultaneously with typing and search,
Also warm back-up can be carried out to index.Sentence mark module is plagiarized, is related to according to the determination of the output result of local search engine module
Dislike the text plagiarized, determine it is corresponding it is each plagiarize sentence in detected text and be accused of being plagiarized location information in text
And it is marked.
Correspondingly, the present invention also provides a kind of texts to plagiarize detection method comprising following steps:
A. text subordinate sentence is given, is several sentences a text segmentation;
B. screening refinement is carried out to the sentence divided, avoids carrying out each word in each sentence in text, sentence
Fingerprint extraction;
C. the sentence after refining to screening carries out fingerprint extraction;
D. the one group of group fingerprint collection extracted copyrighted original text set according to step A to C according to < Text Flag,
Fingerprint collection > mode search typing rope automotive engine system, and guarantee that Text Flag is unique;
E. according to step A to C to Text Feature Extraction to be detected to one group of fingerprint input search engine match, detect
Text to be detected is determined as plagiarizing sentence and output test result with the sentence for having the original text of copyright to possess identical fingerprints;
F. it is marked according to the testing result of step E output to sentence is plagiarized.
The present invention provides text and plagiarizes detection system and method, is sieved by the sentence for obtain after subordinate sentence to text
Choosing is refined, and is overcome existing text and is plagiarized what detection technique needed to take the fingerprint for each text in each sentence, sentence
Disadvantage has also shortened the length of the fingerprint of sentence, for slightly repairing while reducing the quantity for needing the sentence to take the fingerprint
The plagiarism content changed can also detected, and enhance the robustness of system;Number is extracted by the sentence after refining screening to refer to
Line is mapped as character fingerprint, reaches and further shortens fingerprint length, the index size of reduction local search engine, improves search
The technical effect of efficiency.Compared with the existing technology, text provided by the invention plagiarizes searching system and method is more adapted in magnanimity
Plagiarism detection is carried out in text environments.
Detailed description of the invention
Fig. 1 is a kind of existing text similarity detection method flow chart;
Fig. 2 is the frame diagram that text provided by the invention plagiarizes detection system.
Specific embodiment
In order to which technical problem, technical solution and beneficial effect solved by the invention is more clearly understood, tie below
Closing attached drawing, the present invention will be described in further detail.It should be understood that specific embodiment described herein is only to explain this
Invention, is not intended to limit the present invention.Referring to attached drawing 2, the present invention provides a kind of text plagiarism detection system comprising: text
Subordinate sentence module (1), sentence screening are refined module (2), sentence fingerprint extraction module (3), local search engine module (4), are plagiarized
Sentence mark module (5).
Wherein, text subordinate sentence module (1), for a text segmentation be several sentences.Subordinate sentence module (1) is in text
All non-Chinese, non-English, the non-numeric symbol occurred is separator, is several sentences a text segmentation.With sentence
Son is that unit has good detection effect in the case of a text plagiarizes more texts.
Module (2) are refined in sentence screening, the sentence for having divided text subordinate sentence module carries out screening refinement, are avoided to text
Each sentence in this, each word in sentence carry out fingerprint extraction.Module (2), on the one hand that number of words is small are refined in sentence screening
It is deleted in the short sentence of certain limit, the sentence such as number of words less than 10 is deleted;Another aspect use condition random field algorithm carries out
The name entity informations such as name, place name, mechanism name, time in identification sentence are simultaneously deleted, these information pass through two layers of mistake
Filter: first layer is identified name entity using the Named Entity Extraction Model trained by condition random field, is determined as redundancy
And delete, the second layer, it establishes a conventional dictionary and (all texts in one big text library is segmented, pick out frequency most
High part word is as everyday words), then sentence is segmented, if word segmentation result includes the words not having in conventional dictionary,
It is determined as redundancy words and deletes.The probability of " if any duplicating, being a coincidce " can be reduced by deleting short sentence, and can be very big
Fingerprint total number is reduced to reduce search engine pressure, deletes name in sentence, place name, mechanism name, time and some other
Redundancy can cope with plagiarism person and change the case where content is plagiarized in part, for the plagiarism content of some minor modifications, such as more
Changing the contents such as name, place name, mechanism name can also prepare detected, and system is made to have certain robustness.
Sentence fingerprint extraction module (3) carries out fingerprint extraction for the sentence after refining to screening.Sentence fingerprint extraction mould
Block (3) extracts the original figure fingerprint of every words using MD5 algorithm, then before guaranteeing that total fingerprint number multiplicity is sufficiently small
It putting, the digital finger-print that certain length is intercepted from original fingerprint is mapped as character fingerprint (each multiple numbers of character representation),
To reach the index size, the technical effect of raising search efficiency that shorten fingerprint length, reduce local search engine.
Local search engine module (4) is on the one hand used for every text according to<Text Flag, and fingerprint collection>simultaneously guarantees text
The unique original text set of mode typing of this mark refines module, sentence fingerprint extraction via text subordinate sentence module, sentence screening
One group of group fingerprint collection building matching fingerprint base that resume module obtains, is on the other hand used for text to be detected via the text
Subordinate sentence module, one group of fingerprint that module is refined in sentence screening, sentence fingerprint extraction module is handled are carried out with fingerprint base is matched
Matching detects that the sentence that text to be detected possesses identical fingerprints with original text is determined as plagiarizing sentence and exports detection knot
Fruit.Local search engine module (4) may be implemented typing and carry out simultaneously with search, also can carry out warm back-up to index, specifically
It can be realized using Lucene Technical Architecture.The testing result of local search engine module (4) output is m according to being plagiarized
Sentence number be ranked up from more to less by plagiarism Text Flag item, wherein each indicated by plagiarism Text Flag item are as follows: < text
Mark, plagiarism sentence number >.
Plagiarize sentence mark module (5), the testing result label for exporting according to local search engine module plagiarizes text
Originally and sentence is plagiarized accordingly in plagiarism text.It is marked to sentence is plagiarized specifically: by local search engine mould
Front M determined by the corresponding text of Text Flag in plagiarism Text Flag item is come in the output result of block to be accused of being plagiarized
Text, wherein M is user's unrestricted choice integer, M≤m.According to each by plagiarism file content and detected text content, really
It is fixed it is corresponding it is each plagiarize sentence in detected text and be accused of being plagiarized location information in text and be marked.
Further, the present invention also provides a kind of texts to plagiarize detection method comprising following steps:
A. text subordinate sentence is given, is several sentences a text segmentation;
B. screening refinement is carried out to the sentence divided, avoids carrying out each word in each sentence in text, sentence
Fingerprint extraction;
C. the sentence after refining to screening carries out fingerprint extraction;
D. the one group of group fingerprint collection extracted copyrighted original text set according to step A to C according to < Text Flag,
Fingerprint collection > mode search typing rope automotive engine system, and guarantee that Text Flag is unique;
E. according to step A to C to Text Feature Extraction to be detected to one group of fingerprint input search engine match, detect
Text to be detected is determined as plagiarizing sentence and output test result with the sentence for having the original text of copyright to possess identical fingerprints;
F. it is marked according to the testing result of step E output to sentence is plagiarized.
Wherein, step A gives text subordinate sentence, specifically: it is all non-Chinese, non-English, non-numeric with what is occurred in text
Symbol is separator, is several sentences a text segmentation.Step B carries out screening refinement to the sentence divided, specific to wrap
It includes: on the one hand deleting the short sentence that number of words is less than certain limit, the sentence such as number of words less than 10 is deleted;On the other hand item is used
Name, place name, mechanism name, time etc. name entity information and are deleted in part random field algorithm identification sentence, establish one
Conventional dictionary (segments all texts in one big text library, picks out the highest part word of frequency as common
Word), then sentence is segmented, if word segmentation result includes the words not having in conventional dictionary, is determined as redundancy words and deletes.Step
Sentence after rapid C refines screening carries out fingerprint extraction, specifically: the initial data fingerprint of every words is extracted using MD5 algorithm,
Then under the premise of guaranteeing that the total fingerprint number multiplicity of data is sufficiently small, the digital finger-print of certain length is intercepted from original fingerprint
Character fingerprint (each multiple numbers of character representation) are mapped as, to reach the rope for shortening fingerprint length, reducing local search engine
The technical effect drawn size, improve search efficiency.The testing result that step E is generated be m according to by plagiarism sentence number by more to
Be ranked up less by plagiarism Text Flag item, wherein each indicated by plagiarism Text Flag item are as follows: < Text Flag plagiarizes sentence
Subnumber >.Step F is marked according to the testing result of step E to sentence is plagiarized, specifically: by local search engine module
Front M determined by the corresponding text of Text Flag in plagiarism Text Flag item is come in testing result to be accused of being plagiarized text
This, wherein M is user's unrestricted choice integer, M≤m;According in each file content for being accused of being plagiarized and detected text
Hold, determine it is corresponding it is each plagiarize sentence in detected text and be accused of being plagiarized location information in text and be marked.
The present invention compared with the existing technology has the advantage that
1, the quantity and length deleting short sentence and using can largely reduce sentence fingerprint by way of truncating character fingerprint
Degree realizes mass text quick-searching by the way of search engine reverse indexing, and such as in hundred million number of stages texts, (text is averagely long
Spend 2000 words or so) under, retrieval rate reaches a tens of pieces per second to a hundreds of pieces, meets industrial application requirement;
2, also can for having changed the plagiarism content of name, place name, mechanism name, time and some other redundancy
Enough it is detected.Due to deleting name in sentence, place name, mechanism name, time and some other redundancy,
The plagiarism content for having changed name, place name, mechanism name, time and some other redundancy can be also detected;
3, it can accurately judge the sentence plagiarized and label, be to plagiarize to determine unit with sentence, be capable of detecting when
One text plagiarizes the situation of more texts;
4, new Characters and plagiarism can be carried out using the local search engine that Lucene Technical Architecture is realized simultaneously
Retrieval.
Claims (13)
1. a kind of text plagiarizes detection method comprising following steps:
A. text subordinate sentence is given, is several sentences a text segmentation;
B. screening refinement is carried out to the sentence divided, avoids carrying out fingerprint to each word in each sentence in text, sentence
It extracts;
C. the sentence after refining to screening carries out fingerprint extraction, the corresponding fingerprint of every sentence;
D. the one group of group fingerprint collection extracted copyrighted original text set according to step A to C is according to < Text Flag, fingerprint
Collection > mode typing local search engine, and guarantee that Text Flag is unique;
E. according to step A to C to Text Feature Extraction to be detected to one group of fingerprint input local search engine match, detect
Text to be detected is determined as plagiarizing sentence and output test result with the sentence for having the original text of copyright to possess identical fingerprints;
F. it is marked according to the testing result of the output of step E to sentence is plagiarized.
2. text as described in claim 1 plagiarizes detection method, wherein step A, text subordinate sentence is given specifically: to go out in text
Existing all non-Chinese, non-English, non-numeric symbols are separator, are several sentences a text segmentation.
3. text as described in claim 1 plagiarizes detection method, wherein step B, screening is carried out to the sentence divided and refines tool
Body are as follows: by number of words be less than certain limit short sentence delete, by sentence name, place name, mechanism name, time etc. name entity and
Some other redundancy is deleted, and redundancy is by two layers of filtering: first layer, real using the name trained by condition random field
Body identification model identifies name entity, be determined as redundancy and delete;The second layer, according to the conventional dictionary of foundation, by sentence
It is not comprised in words in conventional dictionary in son to be determined as redundancy words and delete, wherein the conventional dictionary is by by one
All texts are segmented in big text library, pick out what the highest part word of frequency was established as everyday words.
4. text as claimed in any one of claims 1-3 plagiarizes detection method, the sentence after wherein step C refines screening
Carry out fingerprint extraction specifically: extract the original figure fingerprint of every words using MD5 algorithm, then guaranteeing total fingerprint number repetition
Spend it is sufficiently small under the premise of, from original fingerprint intercept certain length digital finger-print and be mapped as character fingerprint, each character
Corresponding multiple numbers, further shorten fingerprint length.
5. text as claimed in claim 4 plagiarizes detection method, wherein the local search engine is based on Lucene technology
What framework was realized, new Characters can be carried out simultaneously and plagiarize retrieval.
6. text as claimed in claim 5 plagiarizes detection method, wherein the testing result of the output of the step E is according to quilt
Plagiarism sentence number is ranked up multiple by plagiarism Text Flag item from more to less, wherein each indicated by plagiarism Text Flag item
Are as follows:<Text Flag, by plagiarism sentence number>.
7. text as claimed in claim 6 plagiarizes detection method, wherein step F, it is marked to sentence is plagiarized specifically: from
Step E output testing result in select come front M by plagiarism Text Flag item, according to each by plagiarism Text Flag
File identification in is corresponding by plagiarism file content and detected text content, determines that each plagiarism sentence is being detected text
Originally it and by the location information in plagiarism text and is marked.
8. a kind of text plagiarizes detection system comprising:
Text subordinate sentence module is used to a text segmentation be several sentences;
Module is refined in sentence screening, the sentence for having divided text subordinate sentence module carries out screening refinement, is avoided to every in text
Each word in one sentence, sentence carries out fingerprint extraction;
Sentence fingerprint extraction module carries out fingerprint extraction for the sentence after refining to screening;
On the one hand local search engine module is used for according to<Text Flag, fingerprint collection>simultaneously guarantees the unique mode of Text Flag
The original text set of typing refines module, one that sentence fingerprint extraction module is handled via text subordinate sentence module, sentence screening
Group group fingerprint collection building matching fingerprint base, on the other hand for sieving text to be detected via the text subordinate sentence module, sentence
One group of fingerprint that module is refined in choosing, sentence fingerprint extraction module is handled is matched with fingerprint base is matched, and is detected to be checked
The sentence that text possesses identical fingerprints with original text is surveyed to be determined as plagiarizing sentence and output test result;
Plagiarize sentence mark module, text is plagiarized for the testing result label according to the output of local search engine module and
Sentence is plagiarized accordingly in plagiarism text.
9. text as claimed in claim 8 plagiarizes detection system, wherein text subordinate sentence module gives text subordinate sentence specifically: with text
All non-Chinese, non-English, the non-numeric symbol occurred in this is separator, is several sentences a text segmentation.
10. text as claimed in claim 8 plagiarizes detection system, wherein sentence screening is refined module and is carried out to the sentence divided
Screening is refined, specifically: the short sentence that number of words is less than certain limit is deleted, by name, place name, mechanism name, time etc. in sentence
Entity and some other redundancy is named to delete, redundancy is by two layers of filtering: first layer, using by condition random field
Trained Named Entity Extraction Model identifies name entity, be determined as redundancy and delete;The second layer, according to the normal of foundation
With dictionary, it will be not comprised in words in conventional dictionary in sentence and be determined as redundancy words and delete, wherein the conventional dictionary
It is to pick out the highest part word of frequency by segmenting all texts in one big text library and built as everyday words
Vertical.
11. the text as described in any one of claim 8-10 plagiarizes detection system, wherein sentence fingerprint extraction module is to sieve
Sentence after choosing is refined carries out fingerprint extraction specifically: then the original figure fingerprint that every words are extracted using MD5 algorithm is being protected
Demonstrate,prove total fingerprint number multiplicity it is sufficiently small under the premise of, from original fingerprint intercept certain length digital finger-print and be mapped as character
Fingerprint, each character correspond to multiple numbers, further shorten fingerprint length.
12. text as claimed in claim 11 plagiarizes detection system, wherein the local search engine is based on Lucene skill
What art framework was realized, new Characters can be carried out simultaneously and plagiarize retrieval;Its testing result exported is according to being plagiarized
Sentence number is ranked up multiple by plagiarism Text Flag item from more to less, wherein each indicated by plagiarism Text Flag item are as follows: <
Text Flag, plagiarism sentence number >.
13. text as claimed in claim 12 plagiarizes detection method, carried out wherein plagiarizing sentence mark module to sentence is plagiarized
Label specifically: front M Chinese by plagiarism Text Flag item will be come in local search engine module output test result
The corresponding text determination of this mark is accused of being plagiarized text, corresponds to quilt according to each file identification by plagiarism Text Flag item
File content and detected text content are plagiarized, determine corresponding each plagiarism sentence in detected text and is accused of being plagiarized
Location information in text is simultaneously marked.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711167027.XA CN110019674A (en) | 2017-11-21 | 2017-11-21 | A kind of text plagiarizes detection method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711167027.XA CN110019674A (en) | 2017-11-21 | 2017-11-21 | A kind of text plagiarizes detection method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110019674A true CN110019674A (en) | 2019-07-16 |
Family
ID=67186481
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711167027.XA Pending CN110019674A (en) | 2017-11-21 | 2017-11-21 | A kind of text plagiarizes detection method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110019674A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110543622A (en) * | 2019-08-02 | 2019-12-06 | 北京三快在线科技有限公司 | Text similarity detection method and device, electronic equipment and readable storage medium |
CN112163579A (en) * | 2020-09-30 | 2021-01-01 | 江苏安全技术职业学院 | Student thought report analysis system and method based on text semantic feature analysis |
CN112989793A (en) * | 2021-05-17 | 2021-06-18 | 北京创新乐知网络技术有限公司 | Article detection method and device |
CN115563515A (en) * | 2022-12-07 | 2023-01-03 | 粤港澳大湾区数字经济研究院(福田) | Text similarity detection method, device and equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102880631A (en) * | 2012-07-05 | 2013-01-16 | 湖南大学 | Chinese author identification method based on double-layer classification model, and device for realizing Chinese author identification method |
CN104050299A (en) * | 2014-07-07 | 2014-09-17 | 江苏金智教育信息技术有限公司 | Method for paper duplicate checking |
CN104679728A (en) * | 2015-02-06 | 2015-06-03 | 中国农业大学 | Text similarity detection device |
-
2017
- 2017-11-21 CN CN201711167027.XA patent/CN110019674A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102880631A (en) * | 2012-07-05 | 2013-01-16 | 湖南大学 | Chinese author identification method based on double-layer classification model, and device for realizing Chinese author identification method |
CN104050299A (en) * | 2014-07-07 | 2014-09-17 | 江苏金智教育信息技术有限公司 | Method for paper duplicate checking |
CN104679728A (en) * | 2015-02-06 | 2015-06-03 | 中国农业大学 | Text similarity detection device |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110543622A (en) * | 2019-08-02 | 2019-12-06 | 北京三快在线科技有限公司 | Text similarity detection method and device, electronic equipment and readable storage medium |
CN112163579A (en) * | 2020-09-30 | 2021-01-01 | 江苏安全技术职业学院 | Student thought report analysis system and method based on text semantic feature analysis |
CN112989793A (en) * | 2021-05-17 | 2021-06-18 | 北京创新乐知网络技术有限公司 | Article detection method and device |
CN115563515A (en) * | 2022-12-07 | 2023-01-03 | 粤港澳大湾区数字经济研究院(福田) | Text similarity detection method, device and equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Jain et al. | Multimodal document image classification | |
CN107437038B (en) | Webpage tampering detection method and device | |
El et al. | Authorship analysis studies: A survey | |
Xu et al. | Detecting sensitive information of unstructured text using convolutional neural network | |
CN105912514B (en) | Text copy detection system and method based on fingerprint characteristic | |
CN110019674A (en) | A kind of text plagiarizes detection method and system | |
CN102314418B (en) | Method for comparing Chinese similarity based on context relation | |
CN106021572B (en) | The construction method and device of binary feature dictionary | |
CN112541476B (en) | Malicious webpage identification method based on semantic feature extraction | |
CN113569050B (en) | Method and device for automatically constructing government affair field knowledge map based on deep learning | |
CN111460820A (en) | Network space security domain named entity recognition method and device based on pre-training model BERT | |
CN107871002B (en) | Fingerprint fusion-based cross-language plagiarism detection method | |
El-Shishtawy et al. | An accurate arabic root-based lemmatizer for information retrieval purposes | |
CN113434636A (en) | Semantic-based approximate text search method and device, computer equipment and medium | |
CN107391565A (en) | A kind of across language hierarchy taxonomic hierarchies matching process based on topic model | |
CN106776555A (en) | A kind of comment text entity recognition method and device based on word model | |
CN110929022A (en) | Text abstract generation method and system | |
CN110377690A (en) | A kind of information acquisition method and system based on long-range Relation extraction | |
US20160283582A1 (en) | Device and method for detecting similar text, and application | |
CN109583208A (en) | Malicious software identification method and system based on mobile application comment data | |
CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
Lu et al. | Retrieval of machine-printed latin documents through word shape coding | |
CN114662586A (en) | Method for detecting false information based on common attention multi-mode fusion mechanism | |
CN102722526B (en) | Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method | |
CN111538893B (en) | Method for extracting network security new words from unstructured data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190716 |