CN102508879A - Wavelet transform-based method for detecting copying of semi-structured text structure - Google Patents

Wavelet transform-based method for detecting copying of semi-structured text structure Download PDF

Info

Publication number
CN102508879A
CN102508879A CN2011103160545A CN201110316054A CN102508879A CN 102508879 A CN102508879 A CN 102508879A CN 2011103160545 A CN2011103160545 A CN 2011103160545A CN 201110316054 A CN201110316054 A CN 201110316054A CN 102508879 A CN102508879 A CN 102508879A
Authority
CN
China
Prior art keywords
semi
structured text
text
structured
coefficient
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011103160545A
Other languages
Chinese (zh)
Other versions
CN102508879B (en
Inventor
鲍军鹏
苏杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN 201110316054 priority Critical patent/CN102508879B/en
Publication of CN102508879A publication Critical patent/CN102508879A/en
Application granted granted Critical
Publication of CN102508879B publication Critical patent/CN102508879B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention provides a wavelet transform-based method for detecting copying of a semi-structured text structure. The invention aims to detect the phenomena such as copying and plagiarizing the text structure and the like, assist detection personnel quickly and correctly completing the whole detection process, reduce risks of false detection and omission, and shorten detection time. The method at least comprises the following steps of: cleaning a semi-structured text; coding the semi-structured text; acquiring structural characteristics of the semi-structured text through wavelet transform; and calculating structural similarity, and determining whether the structures are similar or not. According to the method, the semi-structured text is changed into a time-sequential sequence, and the structural characteristics are acquired by a wavelet transform method. Thus, better local structural characteristics and details can be acquired, so that the accuracy of detecting local similar structures is improved.

Description

A kind of method of the semi-structured text structure copy detection based on wavelet transformation
Technical field:
The invention belongs to Intelligent Information Processing and field of computer technology, be specifically related to a kind of accurate and effective detection method of duplicating, plagiarizing to semi-structured text structure.
Background technology:
Current is the epoch of network develop rapidly, all can emerge the webpage of magnanimity every day.These webpages and the information overwhelming majority of being hidden behind thereof all are with HTML (HyperText Markup Language, HTML) text or extend markup language (eXtensible Markup Language, XML) format of text.Html text and XML text all are semi-structured texts.The semi-structured e-text intellecture property of protection is particularly hit misdeeds such as bootlegging, plagiarization under network environment, has become the common recognition of domestic and international all circles, also is a current hot issue that needs to be resolved hurrily simultaneously.Face the complicated and diversified plagiarization of duplicating, only pay close attention to the accuracy that its global structure similarity tends to reduce copy detection.Because a lot of copy texts are not to adopt in its integrity to make a verbatim transcription of the original, but partial copy and splicing or the like.This just requires us to carry out the local message comparison, reduces local identical structure loss, improves identical detection accuracy.
The detection method of duplicating about semi-structured text at present mainly contains three major types: based on the method for nuclear matrix, based on the method for tree editing distance, based on the timing method of Fourier transform.Based on the method for nuclear matrix, utilize to describe between the text structure unit correlativity and to the matrix M of text similarity tolerance percentage contribution and text based on the matrix under the SLVM model space, carry out similarity and calculate, the tolerance text similarity.The tree edit distance approach is meant that with a text-converted be another text, measures its similarity with the minimum cost that is spent in the transfer process.The open defect of this method is that calculation cost is expensive, and its time complexity is O (N 2), N is that element number is the label number in the text.So the tree edit distance approach is not suitable for big text.Based on the timing method of Fourier transform, become time series to semi-structured text, obtain temporal aspect through Fourier transform then, carry out copy detection according to the similarity of proper vector at last.But the Fourier transform reflection is the overall frequency characteristic of signal on All Time, and the frequecy characteristic in the local time can not be provided.So the utilization Fourier transform method can not be observed the local feature and the details of semi-structured text, its investigation to identical property is careful inadequately.
Summary of the invention:
To the problems referred to above, the invention provides a kind of semi-structured text structure and duplicate detecting method based on wavelet transformation.This method also is to become time series to semi-structured text, but obtains architectural feature with small wave converting method.Can obtain partial structurtes characteristic and details preferably like this, thereby improve detection accuracy the identical structure in part.
The present invention provides a kind of semi-structured text structure based on wavelet transformation to duplicate detecting method.Its purpose is directed against phenomenons such as text structure is duplicated, plagiarism exactly and detects, and helps the testing staff fast, correctly to accomplish whole testing process, reduces erroneous detection, omission survey risk, shortens whole detection time.
For achieving the above object, the inventive method comprises at least cleans semi-structured text, double structured text coding, obtains steps such as semi-structured text structure characteristic, computation structure similarity, decision structure be whether identical through wavelet transformation.Clean semi-structured text the character lack of standardization in the semi-structured text is removed, revise unmatched label and invalid string format, original semi-structured text is become the semi-structured text of compliant; Semi-structured text code obtains a structured coding sequence with the structure sequenceization of semi-structured text; Obtaining semi-structured text structure characteristic through wavelet transformation is meant the structured coding sequence is carried out the structural eigenvector that wavelet transformation obtains semi-structured text; The computation structure similarity is then calculated the structure distance between the semi-structured text feature vector, thereby obtains the similarity between the semi-structured text structure; Whether decision structure duplicates, and then whether structure is identical apart from judging two pieces of semi-structured texts according to the structure between the semi-structured text structure, and the if structure distance is then identical less than given threshold value, otherwise does not duplicate.
The semi-structured text of described cleaning be exactly with nonstandard character in the original semi-structured text (as & ,) remove; Revise unmatched label (like label <img >; It does not have to stop label, and label should change into) and invalid string format (as: a=0,0 is the idle character string; Should change a=into " 0 "), the most original semi-structured text becomes the semi-structured text of compliant.
Described double structured text coding, each label all is paired in semi-structured text, is made up of start-tag and termination label; All start-tags are encoded to 1, and stopping label coding is-1, and the appearance according to label in the text at last obtains a text structure coded sequence in proper order, and then this sequence has been represented the architectural feature of text.In cataloged procedure, reject the content of semi-structured text, only kept respective labels.
Describedly obtain semi-structured text structure characteristic through wavelet transformation and obtain according to following process:
(1) position occurs with label and represent time coordinate, the square-wave signal that then gets the text structure coded sequence is represented;
(2) with Ha Er (Haar) small echo text structure coded sequence square-wave signal is carried out wavelet transformation, obtain corresponding wavelet coefficient vector;
(3) the wavelet coefficient vector is compressed, the absolute value and the assign thresholds that are about to the wavelet coefficient sequence compare, and the point that is less than or equal to threshold value becomes 0; Point greater than threshold value becomes the poor of this point value and threshold value, obtains the sparse coefficient sequence of being filled by 0 value;
(4) with non-0 coefficient square, and according to ordering from big to small, m maximal value before getting, and write down this coefficient corresponding position information, obtain the two-dimensional structure proper vector, promptly semi-structured text structure characteristic.As follows:
Figure BDA0000099839000000031
F wherein aThe structural eigenvector of representing semi-structured text (a),
Figure BDA0000099839000000032
Represent semi-structured text (a) through m coefficient after the wavelet transformation square,
Figure BDA0000099839000000041
Represent the position that this coefficient is corresponding.
Described computation structure similarity method is, at first structural eigenvector carried out normalization and handles, then according to the structure distance of two pieces of semi-structured texts of computes:
Dist ( F a , F b ) = 1 2 &Sigma; i = 1 m ( e ~ i a - e ~ i b ) 2 + 1 2 &Sigma; i = 1 m ( n ~ i a - n ~ i b ) 2 ;
Dist (F wherein a, F b) represent the structure distance between semi-structured text (a) and the semi-structured text (b),
Figure BDA0000099839000000043
Represent semi-structured text (a) through i wavelet coefficient square value after the normalization processing,
Figure BDA0000099839000000044
Represent this coefficient correspondence position through the value after the normalization processing,
Figure BDA0000099839000000045
Represent semi-structured text (b) through i wavelet coefficient square value after the normalization processing,
Figure BDA0000099839000000046
Represent this coefficient correspondence position through the value after the normalization processing, m representes the number of wavelet coefficient.
Whether identical method is described decision structure, and whether structure is identical to judge two pieces of semi-structured texts according to the distance of the structure between the semi-structured text structure, and the if structure distance is then identical less than given threshold value, otherwise unidentical.
Distance value has been portrayed the similarity between the text structure, and the big more text structure of distance value is more dissimilar, and distance value small text structure more is similar more.If distance value is less than given threshold value, then the identical property of the structure of these two pieces of semi-structured texts exceeds standard, and decidable is that structure is identical.Thereby help the testing staff to realize the identical detection of text structure.
The described semi-structured text of the inventive method comprises extend markup language (eXtensible Markup Language, XML) text and HTML (HyperText Markup Language, HTML) text.
Description of drawings:
Fig. 1 is the process flow diagram of XML text structure copy detection method.
Fig. 2 is XML structural representation (a).
Fig. 3 is XML structural representation (b).
Embodiment:
Below in conjunction with accompanying drawing the present invention is described further.
The invention provides a kind of method of the semi-structured text structure copy detection based on wavelet transformation; Can help the testing staff fast, accurately to detect the identical semi-structured text of structure; And minimizing omission survey and erroneous detection; Shorten the time of detecting, reach the purpose of behaviors such as hitting bootlegging, plagiarization.Basic ideas of the present invention are: at first we all reject the content in the semi-structured text of compliant, only keep the structural framing that is made up of label.Encode for the label in the semi-structured text through the label coding method, obtain an orderly coded sequence according to the nested and sequencing of label.Regard this sequence as a time series of being made up of equally spaced point on the time shaft, so just can extract the characteristic of this clock signal with method of wavelet, the characteristic of time series signal promptly is exactly semi-structured text structure characteristic so.Measure the similarity between semi-structured text structure through the similarity between the calculated characteristics vector at last, and then whether detect text structure identical.
Thinking according to the inventive method; As with reference to testing process shown in Figure 1, this method comprises at least cleans semi-structured text (01), double structured text coding (02), obtains whether identical steps such as (05) of semi-structured text structure characteristic (03), computation structure similarity (04), decision structure through wavelet transformation.
After original semi-structured text cleans through 01 step text; Obtain normalized semi-structured text; To this structured coding of 02 stepping style of writing, obtain the coded sequence of a semi-structured text structure of perfect representation again, carry out wavelet transformation in 03 step and extract architectural feature and its compression is obtained structural eigenvector; After 04 goes on foot structural eigenvector normalization, calculate the structure distance between the structural eigenvector, in order to the portrayal structural similarity.At last, 05 step with 04 result that obtains of step and given threshold ratio, whether whether the structure of judging two pieces of semi-structured texts similar (promptly duplicating).If structural similarity then is judged to be identical (07 step), otherwise cannot not be judged to be identically (06 step).
It below is the preferred embodiment that the inventor provides.
With reference to Fig. 2 is one piece of original XML text (a) rejecting content of text and the pure structural drawing of ignoring the XML text that is kept after the label value information.The original tag sequence of the text is:
<xml>,<book>,<title>,</title>,<author>,</author>,</book>,<book>,<title>,</title>,<author>,</author>,</book>,</xml>
With reference to Fig. 3 is the original XML text of another piece (b) rejecting content of text and the pure structural drawing of ignoring the XML text that is kept after the label value information.The original tag sequence of the text is:
<xml>,<book>,<title>,</title>,<author>,</author>,</book>,<book>,<title>,</title>,<author>,</author>,</book>,<book>,<title>,</title>,<author>,</author>,</book>,</xml>
According to 02 step of Fig. 1, the XML text structure is encoded, with all start-tags (shape as "<>") be encoded to 1, all termination labels (shape as " /><") be encoded to-1.The corresponding coded sequence of XML text structure then shown in Figure 2 is:
Enc(a)={1,1,1,-1,1,-1,-1,1,1,-1,1,-1,-1,-1}
The corresponding coded sequence of XML text structure shown in Figure 3 is:
Enc(b)={1,1,1,-1,1,-1,-1,1,1,-1,1,-1,-1,1,1,-1,1,-1,-1,-1}
According to 03 step of Fig. 1, when the XML text was carried out wavelet transformation extraction architectural feature, two sequence lengths that at first will be to be compared became unanimity.With the length polishing of sequence Enc (a) and sequence Enc (b), mend 0 for the end of shorter sequence, obtain:
Enc(a)={1,1,1,-1,1,-1,-1,1,1,-1,1,-1,-1,-1,0,0,0,0,0,0}
Enc(b)={1,1,1,-1,1,-1,-1,1,1,-1,1,-1,-1,1,1,-1,1,-1,-1,-1}
Then above-mentioned signal utilization Ha Er (Haar) small echo is carried out wavelet transformation under the out to out, obtain its coefficient sequence.
Coef(a)={0,0,1,0,0.7071,0.7071,0,1,0,0,-1,0,0,1.4142,1.4142,-1.4142,1.4142,1.4142,0,0,0,0}
Coef(b)={0.5,-2,0.5,0,0.7071,0,0,1,0,0,0,1,0,1.4142,1.4142,-1.4142,1.4142,1.4142,-1.4142,1.4142,1.4142,0}
Next following to the structural eigenvector that obtains two pieces of XML texts after the wavelet coefficient compression:
F a=[(1.7272,14),(1.7272,15),(1.7272,16),(1.7272,17),(1.7272,18),(0.81,3)(0.81,8),(0.81,11),(0.3686,5),(0.3686,6)]
F b=[(1.6716,2),(0.5,14),(0.5,15),(0.5,16),(0.5,17),(0.5,18),(0.5,19),(0.5,20),(0.5,21),(0.0858,8)]
Calculating its structure distance at last is:
Dist(F a,F b)=3.9103
Suppose that given threshold value is: 4.Then because Dist (F a, F bTherefore)=3.9103<4 judge that text (a) is identical with the structure of text (b).

Claims (5)

1. semi-structured text structure copy detection method based on wavelet transformation; It is characterized in that, may further comprise the steps: clean semi-structured text, double structured text coding, whether obtain semi-structured text structure characteristic, computation structure similarity, decision structure through wavelet transformation identical;
Clean semi-structured text the character lack of standardization in the semi-structured text is removed, revise unmatched label and invalid string format, original semi-structured text is become the semi-structured text of compliant;
Double structured text is encoded the structure sequenceization of semi-structured text, obtains a structured coding sequence;
Obtaining semi-structured text structure characteristic through wavelet transformation is meant the structured coding sequence is carried out the structural eigenvector that wavelet transformation obtains semi-structured text; The computation structure similarity is then calculated the structure distance between the semi-structured text feature vector, thereby obtains the similarity between the semi-structured text structure;
Whether decision structure duplicates, and then whether structure is identical apart from judging two pieces of semi-structured texts according to the structure between the semi-structured text structure, and the if structure distance is then identical less than given threshold value, otherwise does not duplicate.
2. copy detection method according to claim 1 is characterized in that: during double structured text coding, each label all is paired in the semi-structured text, is made up of start-tag and termination label; All start-tags are encoded to 1, and stopping label coding is-1, and the appearance according to label in the text at last obtains a text structure coded sequence in proper order, and then this sequence has been represented the architectural feature of text; In cataloged procedure, reject the content of semi-structured text, only kept respective labels.
3. copy detection method according to claim 1 is characterized in that: when obtaining semi-structured text structure characteristic through wavelet transformation,
(1) position occurs with label and represent time coordinate, the square-wave signal that then gets the text structure coded sequence is represented;
(2) with the Ha Er small echo text structure coded sequence square-wave signal is carried out wavelet transformation, obtain corresponding wavelet coefficient vector;
(3) the wavelet coefficient vector is compressed, the absolute value and the assign thresholds that are about to the wavelet coefficient sequence compare, and the point that is less than or equal to threshold value becomes 0; Point greater than threshold value becomes the poor of this point value and threshold value, obtains the sparse coefficient sequence of being filled by 0 value;
(4) with non-0 coefficient square, and according to ordering from big to small, m maximal value before getting, and write down this coefficient corresponding position information, obtain the two-dimensional structure proper vector, promptly semi-structured text structure characteristic.
4. copy detection method according to claim 1 is characterized in that: during the computation structure similarity, at first structural eigenvector carried out normalization and handles, then according to the structure distance of two pieces of semi-structured texts of computes:
Dist ( F a , F b ) = 1 2 &Sigma; i = 1 m ( e ~ i a - e ~ i b ) 2 + 1 2 &Sigma; i = 1 m ( n ~ i a - n ~ i b ) 2
Dist (F wherein a, F b) structure distance between expression semi-structured text a and the semi-structured text b, Represent that semi-structured text a handles i wavelet coefficient square value afterwards through normalization,
Figure FDA0000099838990000023
Represent this coefficient correspondence position through the value after the normalization processing, Represent that semi-structured text b handles i wavelet coefficient square value afterwards through normalization,
Figure FDA0000099838990000025
Represent this coefficient correspondence position through the value after the normalization processing, m representes the number of wavelet coefficient.
5. copy detection method according to claim 1 is characterized in that: semi-structured text comprises expandable mark language XML text and HTML html text.
CN 201110316054 2011-10-18 2011-10-18 Wavelet transform-based method for detecting copying of semi-structured text structure Expired - Fee Related CN102508879B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110316054 CN102508879B (en) 2011-10-18 2011-10-18 Wavelet transform-based method for detecting copying of semi-structured text structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110316054 CN102508879B (en) 2011-10-18 2011-10-18 Wavelet transform-based method for detecting copying of semi-structured text structure

Publications (2)

Publication Number Publication Date
CN102508879A true CN102508879A (en) 2012-06-20
CN102508879B CN102508879B (en) 2013-07-31

Family

ID=46220965

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110316054 Expired - Fee Related CN102508879B (en) 2011-10-18 2011-10-18 Wavelet transform-based method for detecting copying of semi-structured text structure

Country Status (1)

Country Link
CN (1) CN102508879B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107644104A (en) * 2017-10-17 2018-01-30 北京锐安科技有限公司 A kind of text feature and system

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
JULINDA GLLAVATA等: "《Text Detection in Images Based on Unsupervised Classification of High-Frequency Wavelet Coefficients》", 《PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION》 *
JUN PENG BAO等: "《Copy detection in Chinese documents using Ferret》", 《LANG RESOURCES & EVALUATION》 *
JUNPENG BAO等: "《Comparing Different Text Similarity Methods》", 《UH COMPUTER SCIENCE TECHNICAL REPORT》 *
周治平等: "《基于小波和不变矩的图像copy-move篡改盲检测》", 《信息网络安全》 *
鲍军鹏等: "《自然语言文档复制检测研究综述》", 《软件学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107644104A (en) * 2017-10-17 2018-01-30 北京锐安科技有限公司 A kind of text feature and system
CN107644104B (en) * 2017-10-17 2021-06-25 北京锐安科技有限公司 Text feature extraction method and system

Also Published As

Publication number Publication date
CN102508879B (en) 2013-07-31

Similar Documents

Publication Publication Date Title
CN100576207C (en) Remove the method for repeating objects based on metadata
CN103853834B (en) Text structure analysis-based Web document abstract generation method
CN110598000A (en) Relationship extraction and knowledge graph construction method based on deep learning model
CN113935502B (en) Dam-oriented emergency condition event extraction method based on double attention mechanism
CN113191148B (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN104268200A (en) Unsupervised named entity semantic disambiguation method based on deep learning
CN107463571B (en) Webpage duplicate elimination method and device and storage medium
CN103473409A (en) FPGA (filed programmable gate array) fault automatic diagnosing method based on knowledge database
CN102915361B (en) Webpage text extracting method based on character distribution characteristic
CN103678528A (en) Electronic homework plagiarism preventing system and method based on paragraph plagiarism detection
US20240273555A1 (en) Method, apparatus, device and storage medium for verifying real store
CN106127265A (en) A kind of text in picture identification error correction method based on activating force model
CN117079048A (en) Geological disaster image recognition method and system based on CLIP model
CN104615728B (en) A kind of webpage context extraction method and device
张小明 et al. Research of automatic topic detection based on incremental clustering
CN110134762A (en) Deceptive information identifying system and recognition methods based on event topic analysis
CN102508879A (en) Wavelet transform-based method for detecting copying of semi-structured text structure
CN111143457A (en) Student homonymy disambiguation method based on multiple source data sets
CN103761312B (en) Information extraction system and method for multi-recording webpage
CN114330350B (en) Named entity recognition method and device, electronic equipment and storage medium
CN101246473B (en) Segmentation system evaluating method and segmentation evaluating system
李书琴 et al. Joint extraction method of entity and relation in maize breeding based on BERT-CRF and word embedding
徐耀丽 et al. Repairing Inconsistent Relational Data Based on Possible World Model
CN114201606B (en) Training and application methods and devices of webpage tampering detection model
Ling-li et al. Aseismatic serviceability analysis of water supply network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130731

Termination date: 20171018