CN102508879A - Wavelet transform-based method for detecting copying of semi-structured text structure - Google Patents
Wavelet transform-based method for detecting copying of semi-structured text structure Download PDFInfo
- Publication number
- CN102508879A CN102508879A CN2011103160545A CN201110316054A CN102508879A CN 102508879 A CN102508879 A CN 102508879A CN 2011103160545 A CN2011103160545 A CN 2011103160545A CN 201110316054 A CN201110316054 A CN 201110316054A CN 102508879 A CN102508879 A CN 102508879A
- Authority
- CN
- China
- Prior art keywords
- semi
- structured text
- text
- structured
- coefficient
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Document Processing Apparatus (AREA)
Abstract
The invention provides a wavelet transform-based method for detecting copying of a semi-structured text structure. The invention aims to detect the phenomena such as copying and plagiarizing the text structure and the like, assist detection personnel quickly and correctly completing the whole detection process, reduce risks of false detection and omission, and shorten detection time. The method at least comprises the following steps of: cleaning a semi-structured text; coding the semi-structured text; acquiring structural characteristics of the semi-structured text through wavelet transform; and calculating structural similarity, and determining whether the structures are similar or not. According to the method, the semi-structured text is changed into a time-sequential sequence, and the structural characteristics are acquired by a wavelet transform method. Thus, better local structural characteristics and details can be acquired, so that the accuracy of detecting local similar structures is improved.
Description
Technical field:
The invention belongs to Intelligent Information Processing and field of computer technology, be specifically related to a kind of accurate and effective detection method of duplicating, plagiarizing to semi-structured text structure.
Background technology:
Current is the epoch of network develop rapidly, all can emerge the webpage of magnanimity every day.These webpages and the information overwhelming majority of being hidden behind thereof all are with HTML (HyperText Markup Language, HTML) text or extend markup language (eXtensible Markup Language, XML) format of text.Html text and XML text all are semi-structured texts.The semi-structured e-text intellecture property of protection is particularly hit misdeeds such as bootlegging, plagiarization under network environment, has become the common recognition of domestic and international all circles, also is a current hot issue that needs to be resolved hurrily simultaneously.Face the complicated and diversified plagiarization of duplicating, only pay close attention to the accuracy that its global structure similarity tends to reduce copy detection.Because a lot of copy texts are not to adopt in its integrity to make a verbatim transcription of the original, but partial copy and splicing or the like.This just requires us to carry out the local message comparison, reduces local identical structure loss, improves identical detection accuracy.
The detection method of duplicating about semi-structured text at present mainly contains three major types: based on the method for nuclear matrix, based on the method for tree editing distance, based on the timing method of Fourier transform.Based on the method for nuclear matrix, utilize to describe between the text structure unit correlativity and to the matrix M of text similarity tolerance percentage contribution and text based on the matrix under the SLVM model space, carry out similarity and calculate, the tolerance text similarity.The tree edit distance approach is meant that with a text-converted be another text, measures its similarity with the minimum cost that is spent in the transfer process.The open defect of this method is that calculation cost is expensive, and its time complexity is O (N
2), N is that element number is the label number in the text.So the tree edit distance approach is not suitable for big text.Based on the timing method of Fourier transform, become time series to semi-structured text, obtain temporal aspect through Fourier transform then, carry out copy detection according to the similarity of proper vector at last.But the Fourier transform reflection is the overall frequency characteristic of signal on All Time, and the frequecy characteristic in the local time can not be provided.So the utilization Fourier transform method can not be observed the local feature and the details of semi-structured text, its investigation to identical property is careful inadequately.
Summary of the invention:
To the problems referred to above, the invention provides a kind of semi-structured text structure and duplicate detecting method based on wavelet transformation.This method also is to become time series to semi-structured text, but obtains architectural feature with small wave converting method.Can obtain partial structurtes characteristic and details preferably like this, thereby improve detection accuracy the identical structure in part.
The present invention provides a kind of semi-structured text structure based on wavelet transformation to duplicate detecting method.Its purpose is directed against phenomenons such as text structure is duplicated, plagiarism exactly and detects, and helps the testing staff fast, correctly to accomplish whole testing process, reduces erroneous detection, omission survey risk, shortens whole detection time.
For achieving the above object, the inventive method comprises at least cleans semi-structured text, double structured text coding, obtains steps such as semi-structured text structure characteristic, computation structure similarity, decision structure be whether identical through wavelet transformation.Clean semi-structured text the character lack of standardization in the semi-structured text is removed, revise unmatched label and invalid string format, original semi-structured text is become the semi-structured text of compliant; Semi-structured text code obtains a structured coding sequence with the structure sequenceization of semi-structured text; Obtaining semi-structured text structure characteristic through wavelet transformation is meant the structured coding sequence is carried out the structural eigenvector that wavelet transformation obtains semi-structured text; The computation structure similarity is then calculated the structure distance between the semi-structured text feature vector, thereby obtains the similarity between the semi-structured text structure; Whether decision structure duplicates, and then whether structure is identical apart from judging two pieces of semi-structured texts according to the structure between the semi-structured text structure, and the if structure distance is then identical less than given threshold value, otherwise does not duplicate.
The semi-structured text of described cleaning be exactly with nonstandard character in the original semi-structured text (as & ,) remove; Revise unmatched label (like label <img >; It does not have to stop label, and label should change into) and invalid string format (as: a=0,0 is the idle character string; Should change a=into " 0 "), the most original semi-structured text becomes the semi-structured text of compliant.
Described double structured text coding, each label all is paired in semi-structured text, is made up of start-tag and termination label; All start-tags are encoded to 1, and stopping label coding is-1, and the appearance according to label in the text at last obtains a text structure coded sequence in proper order, and then this sequence has been represented the architectural feature of text.In cataloged procedure, reject the content of semi-structured text, only kept respective labels.
Describedly obtain semi-structured text structure characteristic through wavelet transformation and obtain according to following process:
(1) position occurs with label and represent time coordinate, the square-wave signal that then gets the text structure coded sequence is represented;
(2) with Ha Er (Haar) small echo text structure coded sequence square-wave signal is carried out wavelet transformation, obtain corresponding wavelet coefficient vector;
(3) the wavelet coefficient vector is compressed, the absolute value and the assign thresholds that are about to the wavelet coefficient sequence compare, and the point that is less than or equal to threshold value becomes 0; Point greater than threshold value becomes the poor of this point value and threshold value, obtains the sparse coefficient sequence of being filled by 0 value;
(4) with non-0 coefficient square, and according to ordering from big to small, m maximal value before getting, and write down this coefficient corresponding position information, obtain the two-dimensional structure proper vector, promptly semi-structured text structure characteristic.As follows:
F wherein
aThe structural eigenvector of representing semi-structured text (a),
Represent semi-structured text (a) through m coefficient after the wavelet transformation square,
Represent the position that this coefficient is corresponding.
Described computation structure similarity method is, at first structural eigenvector carried out normalization and handles, then according to the structure distance of two pieces of semi-structured texts of computes:
Dist (F wherein
a, F
b) represent the structure distance between semi-structured text (a) and the semi-structured text (b),
Represent semi-structured text (a) through i wavelet coefficient square value after the normalization processing,
Represent this coefficient correspondence position through the value after the normalization processing,
Represent semi-structured text (b) through i wavelet coefficient square value after the normalization processing,
Represent this coefficient correspondence position through the value after the normalization processing, m representes the number of wavelet coefficient.
Whether identical method is described decision structure, and whether structure is identical to judge two pieces of semi-structured texts according to the distance of the structure between the semi-structured text structure, and the if structure distance is then identical less than given threshold value, otherwise unidentical.
Distance value has been portrayed the similarity between the text structure, and the big more text structure of distance value is more dissimilar, and distance value small text structure more is similar more.If distance value is less than given threshold value, then the identical property of the structure of these two pieces of semi-structured texts exceeds standard, and decidable is that structure is identical.Thereby help the testing staff to realize the identical detection of text structure.
The described semi-structured text of the inventive method comprises extend markup language (eXtensible Markup Language, XML) text and HTML (HyperText Markup Language, HTML) text.
Description of drawings:
Fig. 1 is the process flow diagram of XML text structure copy detection method.
Fig. 2 is XML structural representation (a).
Fig. 3 is XML structural representation (b).
Embodiment:
Below in conjunction with accompanying drawing the present invention is described further.
The invention provides a kind of method of the semi-structured text structure copy detection based on wavelet transformation; Can help the testing staff fast, accurately to detect the identical semi-structured text of structure; And minimizing omission survey and erroneous detection; Shorten the time of detecting, reach the purpose of behaviors such as hitting bootlegging, plagiarization.Basic ideas of the present invention are: at first we all reject the content in the semi-structured text of compliant, only keep the structural framing that is made up of label.Encode for the label in the semi-structured text through the label coding method, obtain an orderly coded sequence according to the nested and sequencing of label.Regard this sequence as a time series of being made up of equally spaced point on the time shaft, so just can extract the characteristic of this clock signal with method of wavelet, the characteristic of time series signal promptly is exactly semi-structured text structure characteristic so.Measure the similarity between semi-structured text structure through the similarity between the calculated characteristics vector at last, and then whether detect text structure identical.
Thinking according to the inventive method; As with reference to testing process shown in Figure 1, this method comprises at least cleans semi-structured text (01), double structured text coding (02), obtains whether identical steps such as (05) of semi-structured text structure characteristic (03), computation structure similarity (04), decision structure through wavelet transformation.
After original semi-structured text cleans through 01 step text; Obtain normalized semi-structured text; To this structured coding of 02 stepping style of writing, obtain the coded sequence of a semi-structured text structure of perfect representation again, carry out wavelet transformation in 03 step and extract architectural feature and its compression is obtained structural eigenvector; After 04 goes on foot structural eigenvector normalization, calculate the structure distance between the structural eigenvector, in order to the portrayal structural similarity.At last, 05 step with 04 result that obtains of step and given threshold ratio, whether whether the structure of judging two pieces of semi-structured texts similar (promptly duplicating).If structural similarity then is judged to be identical (07 step), otherwise cannot not be judged to be identically (06 step).
It below is the preferred embodiment that the inventor provides.
With reference to Fig. 2 is one piece of original XML text (a) rejecting content of text and the pure structural drawing of ignoring the XML text that is kept after the label value information.The original tag sequence of the text is:
<xml>,<book>,<title>,</title>,<author>,</author>,</book>,<book>,<title>,</title>,<author>,</author>,</book>,</xml>
With reference to Fig. 3 is the original XML text of another piece (b) rejecting content of text and the pure structural drawing of ignoring the XML text that is kept after the label value information.The original tag sequence of the text is:
<xml>,<book>,<title>,</title>,<author>,</author>,</book>,<book>,<title>,</title>,<author>,</author>,</book>,<book>,<title>,</title>,<author>,</author>,</book>,</xml>
According to 02 step of Fig. 1, the XML text structure is encoded, with all start-tags (shape as "<>") be encoded to 1, all termination labels (shape as " /><") be encoded to-1.The corresponding coded sequence of XML text structure then shown in Figure 2 is:
Enc(a)={1,1,1,-1,1,-1,-1,1,1,-1,1,-1,-1,-1}
The corresponding coded sequence of XML text structure shown in Figure 3 is:
Enc(b)={1,1,1,-1,1,-1,-1,1,1,-1,1,-1,-1,1,1,-1,1,-1,-1,-1}
According to 03 step of Fig. 1, when the XML text was carried out wavelet transformation extraction architectural feature, two sequence lengths that at first will be to be compared became unanimity.With the length polishing of sequence Enc (a) and sequence Enc (b), mend 0 for the end of shorter sequence, obtain:
Enc(a)={1,1,1,-1,1,-1,-1,1,1,-1,1,-1,-1,-1,0,0,0,0,0,0}
Enc(b)={1,1,1,-1,1,-1,-1,1,1,-1,1,-1,-1,1,1,-1,1,-1,-1,-1}
Then above-mentioned signal utilization Ha Er (Haar) small echo is carried out wavelet transformation under the out to out, obtain its coefficient sequence.
Coef(a)={0,0,1,0,0.7071,0.7071,0,1,0,0,-1,0,0,1.4142,1.4142,-1.4142,1.4142,1.4142,0,0,0,0}
Coef(b)={0.5,-2,0.5,0,0.7071,0,0,1,0,0,0,1,0,1.4142,1.4142,-1.4142,1.4142,1.4142,-1.4142,1.4142,1.4142,0}
Next following to the structural eigenvector that obtains two pieces of XML texts after the wavelet coefficient compression:
F
a=[(1.7272,14),(1.7272,15),(1.7272,16),(1.7272,17),(1.7272,18),(0.81,3)(0.81,8),(0.81,11),(0.3686,5),(0.3686,6)]
F
b=[(1.6716,2),(0.5,14),(0.5,15),(0.5,16),(0.5,17),(0.5,18),(0.5,19),(0.5,20),(0.5,21),(0.0858,8)]
Calculating its structure distance at last is:
Dist(F
a,F
b)=3.9103
Suppose that given threshold value is: 4.Then because Dist (F
a, F
bTherefore)=3.9103<4 judge that text (a) is identical with the structure of text (b).
Claims (5)
1. semi-structured text structure copy detection method based on wavelet transformation; It is characterized in that, may further comprise the steps: clean semi-structured text, double structured text coding, whether obtain semi-structured text structure characteristic, computation structure similarity, decision structure through wavelet transformation identical;
Clean semi-structured text the character lack of standardization in the semi-structured text is removed, revise unmatched label and invalid string format, original semi-structured text is become the semi-structured text of compliant;
Double structured text is encoded the structure sequenceization of semi-structured text, obtains a structured coding sequence;
Obtaining semi-structured text structure characteristic through wavelet transformation is meant the structured coding sequence is carried out the structural eigenvector that wavelet transformation obtains semi-structured text; The computation structure similarity is then calculated the structure distance between the semi-structured text feature vector, thereby obtains the similarity between the semi-structured text structure;
Whether decision structure duplicates, and then whether structure is identical apart from judging two pieces of semi-structured texts according to the structure between the semi-structured text structure, and the if structure distance is then identical less than given threshold value, otherwise does not duplicate.
2. copy detection method according to claim 1 is characterized in that: during double structured text coding, each label all is paired in the semi-structured text, is made up of start-tag and termination label; All start-tags are encoded to 1, and stopping label coding is-1, and the appearance according to label in the text at last obtains a text structure coded sequence in proper order, and then this sequence has been represented the architectural feature of text; In cataloged procedure, reject the content of semi-structured text, only kept respective labels.
3. copy detection method according to claim 1 is characterized in that: when obtaining semi-structured text structure characteristic through wavelet transformation,
(1) position occurs with label and represent time coordinate, the square-wave signal that then gets the text structure coded sequence is represented;
(2) with the Ha Er small echo text structure coded sequence square-wave signal is carried out wavelet transformation, obtain corresponding wavelet coefficient vector;
(3) the wavelet coefficient vector is compressed, the absolute value and the assign thresholds that are about to the wavelet coefficient sequence compare, and the point that is less than or equal to threshold value becomes 0; Point greater than threshold value becomes the poor of this point value and threshold value, obtains the sparse coefficient sequence of being filled by 0 value;
(4) with non-0 coefficient square, and according to ordering from big to small, m maximal value before getting, and write down this coefficient corresponding position information, obtain the two-dimensional structure proper vector, promptly semi-structured text structure characteristic.
4. copy detection method according to claim 1 is characterized in that: during the computation structure similarity, at first structural eigenvector carried out normalization and handles, then according to the structure distance of two pieces of semi-structured texts of computes:
Dist (F wherein
a, F
b) structure distance between expression semi-structured text a and the semi-structured text b,
Represent that semi-structured text a handles i wavelet coefficient square value afterwards through normalization,
Represent this coefficient correspondence position through the value after the normalization processing,
Represent that semi-structured text b handles i wavelet coefficient square value afterwards through normalization,
Represent this coefficient correspondence position through the value after the normalization processing, m representes the number of wavelet coefficient.
5. copy detection method according to claim 1 is characterized in that: semi-structured text comprises expandable mark language XML text and HTML html text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110316054 CN102508879B (en) | 2011-10-18 | 2011-10-18 | Wavelet transform-based method for detecting copying of semi-structured text structure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110316054 CN102508879B (en) | 2011-10-18 | 2011-10-18 | Wavelet transform-based method for detecting copying of semi-structured text structure |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102508879A true CN102508879A (en) | 2012-06-20 |
CN102508879B CN102508879B (en) | 2013-07-31 |
Family
ID=46220965
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 201110316054 Expired - Fee Related CN102508879B (en) | 2011-10-18 | 2011-10-18 | Wavelet transform-based method for detecting copying of semi-structured text structure |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102508879B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107644104A (en) * | 2017-10-17 | 2018-01-30 | 北京锐安科技有限公司 | A kind of text feature and system |
-
2011
- 2011-10-18 CN CN 201110316054 patent/CN102508879B/en not_active Expired - Fee Related
Non-Patent Citations (5)
Title |
---|
JULINDA GLLAVATA等: "《Text Detection in Images Based on Unsupervised Classification of High-Frequency Wavelet Coefficients》", 《PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION》 * |
JUN PENG BAO等: "《Copy detection in Chinese documents using Ferret》", 《LANG RESOURCES & EVALUATION》 * |
JUNPENG BAO等: "《Comparing Different Text Similarity Methods》", 《UH COMPUTER SCIENCE TECHNICAL REPORT》 * |
周治平等: "《基于小波和不变矩的图像copy-move篡改盲检测》", 《信息网络安全》 * |
鲍军鹏等: "《自然语言文档复制检测研究综述》", 《软件学报》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107644104A (en) * | 2017-10-17 | 2018-01-30 | 北京锐安科技有限公司 | A kind of text feature and system |
CN107644104B (en) * | 2017-10-17 | 2021-06-25 | 北京锐安科技有限公司 | Text feature extraction method and system |
Also Published As
Publication number | Publication date |
---|---|
CN102508879B (en) | 2013-07-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN100576207C (en) | Remove the method for repeating objects based on metadata | |
CN103853834B (en) | Text structure analysis-based Web document abstract generation method | |
CN110598000A (en) | Relationship extraction and knowledge graph construction method based on deep learning model | |
CN113935502B (en) | Dam-oriented emergency condition event extraction method based on double attention mechanism | |
CN113191148B (en) | Rail transit entity identification method based on semi-supervised learning and clustering | |
CN104268200A (en) | Unsupervised named entity semantic disambiguation method based on deep learning | |
CN107463571B (en) | Webpage duplicate elimination method and device and storage medium | |
CN103473409A (en) | FPGA (filed programmable gate array) fault automatic diagnosing method based on knowledge database | |
CN102915361B (en) | Webpage text extracting method based on character distribution characteristic | |
CN103678528A (en) | Electronic homework plagiarism preventing system and method based on paragraph plagiarism detection | |
US20240273555A1 (en) | Method, apparatus, device and storage medium for verifying real store | |
CN106127265A (en) | A kind of text in picture identification error correction method based on activating force model | |
CN117079048A (en) | Geological disaster image recognition method and system based on CLIP model | |
CN104615728B (en) | A kind of webpage context extraction method and device | |
张小明 et al. | Research of automatic topic detection based on incremental clustering | |
CN110134762A (en) | Deceptive information identifying system and recognition methods based on event topic analysis | |
CN102508879A (en) | Wavelet transform-based method for detecting copying of semi-structured text structure | |
CN111143457A (en) | Student homonymy disambiguation method based on multiple source data sets | |
CN103761312B (en) | Information extraction system and method for multi-recording webpage | |
CN114330350B (en) | Named entity recognition method and device, electronic equipment and storage medium | |
CN101246473B (en) | Segmentation system evaluating method and segmentation evaluating system | |
李书琴 et al. | Joint extraction method of entity and relation in maize breeding based on BERT-CRF and word embedding | |
徐耀丽 et al. | Repairing Inconsistent Relational Data Based on Possible World Model | |
CN114201606B (en) | Training and application methods and devices of webpage tampering detection model | |
Ling-li et al. | Aseismatic serviceability analysis of water supply network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20130731 Termination date: 20171018 |