CN102508879A

CN102508879A - Wavelet transform-based method for detecting copying of semi-structured text structure

Info

Publication number: CN102508879A
Application number: CN2011103160545A
Authority: CN
Inventors: 鲍军鹏; 苏杰
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2011-10-18
Filing date: 2011-10-18
Publication date: 2012-06-20
Anticipated expiration: 2031-10-18
Also published as: CN102508879B

Abstract

The invention provides a method for detecting duplication of semi-structured text structure based on wavelet transform. Its purpose is to detect text structure duplication, plagiarism, etc., to help inspectors quickly and correctly complete the entire inspection process, reduce the risk of false detection and missed detection, and shorten the detection time. The method of the invention at least comprises the steps of cleaning the semi-structured text, encoding the semi-structured text, obtaining the structural features of the semi-structured text through wavelet transform, calculating the similarity of the structure, judging whether the structure is identical, and the like. This method also converts semi-structured text into time series, but uses wavelet transform method to obtain structural features. In this way, better local structural features and details can be obtained, thereby improving the detection accuracy of local similar structures.

Description

A method of detecting duplication of semi-structured text structure based on wavelet transform

技术领域： Technical field:

本发明属于智能信息处理和计算机技术领域，具体涉及一种针对半结构化文本结构复制、抄袭的准确、有效检测方法。The invention belongs to the field of intelligent information processing and computer technology, and specifically relates to an accurate and effective detection method for duplication and plagiarism of semi-structured text structures.

背景技术： Background technique:

当前是网络飞速发展的时代，每天都会涌现出海量的网页。这些网页及其背后所隐藏的信息绝大部分都是以超文本标记语言(HyperText MarkupLanguage，HTML)文本或者可扩展标记语言(eXtensible Markup Language，XML)文本的格式存储。HTML文本和XML文本都是半结构化文本。在网络环境下保护半结构化电子文本知识产权，特别是打击非法复制、剽窃等不端行为，已成为国内外各界的共识，同时也是当前亟待解决的一个热点问题。面对复杂多样的复制剽窃，只关注其全局结构相似性往往会降低复制检测的准确性。因为很多复制文本并不是全部照搬照抄原文，而是局部复制和拼接等等。这就要求我们进行局部信息比对，减少局部雷同结构漏检率，提高雷同检测准确性。The current era of rapid development of the Internet, a large number of web pages will emerge every day. Most of these web pages and the information hidden behind them are stored in the format of HyperText Markup Language (HyperText Markup Language, HTML) text or Extensible Markup Language (eXtensible Markup Language, XML) text. Both HTML text and XML text are semi-structured text. Protecting the intellectual property rights of semi-structured electronic texts in the network environment, especially combating illegal copying, plagiarism and other misconduct, has become the consensus of all circles at home and abroad, and it is also a hot issue that needs to be resolved urgently. In the face of complex and diverse copy plagiarism, only focusing on its global structural similarity often reduces the accuracy of copy detection. Because many copied texts are not all copied from the original text, but partial copy and splicing and so on. This requires us to compare local information, reduce the missed detection rate of local similar structures, and improve the accuracy of similar detection.

目前关于半结构化文本复制的检测方法主要有三大类：基于核矩阵的方法、基于树编辑距离的方法、基于傅里叶变换的时序方法。基于核矩阵的方法，利用描述文本结构单元之间的相关性及其对文本相似性度量贡献程度的矩阵M，和文本在基于SLVM模型空间下的矩阵，进行相似性计算，度量文本相似性。树编辑距离方法是指将一个文本转换为另一个文本，用转换过程中所花费的最小代价来度量其相似性。这种方法的明显缺陷是计算代价昂贵，其时间复杂度为O(N²)，N为文本中元素个数即标签个数。所以树编辑距离方法不适用于大文本。基于傅里叶变换的时序方法，把半结构化文本变成时序序列，然后通过傅里叶变换获得时序特征，最后根据特征向量的相似性进行复制检测。但是傅立叶变换反映的是信号在全部时间上的整体频率特征，不能提供局部时间上的频率特征。所以运用傅里叶变换的方法不能观察到半结构化文本的局部特征和细节，其对雷同性的考察不够细致。At present, there are three main categories of detection methods for semi-structured text duplication: methods based on kernel matrix, methods based on tree edit distance, and time series methods based on Fourier transform. The method based on the kernel matrix uses the matrix M describing the correlation between text structural units and their contribution to the text similarity measurement, and the matrix of the text in the SLVM-based model space to perform similarity calculations and measure text similarity. The tree edit distance method refers to transforming a text into another text, and measures its similarity with the minimum cost spent in the conversion process. The obvious defect of this method is that the calculation is expensive, and its time complexity is O(N ² ), where N is the number of elements in the text, that is, the number of tags. So the tree edit distance method is not suitable for large texts. The time series method based on Fourier transform converts semi-structured text into time series, then obtains time series features through Fourier transform, and finally performs copy detection according to the similarity of feature vectors. But the Fourier transform reflects the overall frequency characteristics of the signal at all times, and cannot provide the frequency characteristics at local time. Therefore, the local features and details of the semi-structured text cannot be observed by the method of Fourier transform, and the investigation of the similarity is not detailed enough.

发明内容： Invention content:

针对上述问题，本发明提供了一种基于小波变换的半结构化文本结构复制检方法。该方法也是把半结构化文本变成时序序列，但是用小波变换方法获得结构特征。这样可以获得较好的局部结构特征和细节，从而提高了对局部雷同结构的检测准确性。In view of the above problems, the present invention provides a method for detecting duplication of semi-structured text structures based on wavelet transform. This method also converts semi-structured text into time series, but uses wavelet transform method to obtain structural features. In this way, better local structural features and details can be obtained, thereby improving the detection accuracy of local similar structures.

本发明提供一种基于小波变换的半结构化文本结构复制检方法。其目的就是针对文本结构复制、抄袭等现象进行检测，帮助检测人员快速、正确地完成整个检测过程，降低误检测、漏检测风险，缩短整个检测时间。The invention provides a method for detecting duplication of semi-structured text structure based on wavelet transform. Its purpose is to detect text structure duplication, plagiarism, etc., to help inspectors quickly and correctly complete the entire inspection process, reduce the risk of false detection and missed detection, and shorten the entire detection time.

为达到上述目的，本发明方法至少包括清洗半结构化文本、对半结构化文本编码、通过小波变换获得半结构化文本结构特征、计算结构相似性、判定结构是否雷同等步骤。清洗半结构化文本将半结构化文本中的不规范字符去除，修改不匹配的标签和无效的字符串格式，将原始半结构化文本变成符合规范的半结构化文本；半结构化文本编码将半结构化文本的结构序列化，得到一个结构编码序列；通过小波变换获得半结构化文本结构特征是指对结构编码序列进行小波变换获得半结构化文本的结构特征向量；计算结构相似性则计算半结构化文本特征向量之间的结构距离，从而得到半结构化文本结构之间的相似性；判定结构是否雷同则根据半结构化文本结构之间的结构距离来判定两篇半结构化文本是否结构雷同，如果结构距离小于给定阈值则雷同，否则不雷同。To achieve the above purpose, the method of the present invention at least includes the steps of cleaning the semi-structured text, encoding the semi-structured text, obtaining structural features of the semi-structured text through wavelet transform, calculating structural similarity, and judging whether the structure is similar. Clean semi-structured text to remove non-standard characters in semi-structured text, modify unmatched tags and invalid string formats, and convert original semi-structured text into semi-structured text that conforms to specifications; semi-structured text encoding Serialize the structure of the semi-structured text to obtain a structural code sequence; obtain the structural features of the semi-structured text through wavelet transform means to perform wavelet transform on the structural code sequence to obtain the structural feature vector of the semi-structured text; calculate the structural similarity Calculate the structural distance between the semi-structured text feature vectors, so as to obtain the similarity between the semi-structured text structures; to determine whether the structures are similar, judge two semi-structured texts according to the structural distance between the semi-structured text structures Whether the structure is the same, if the structure distance is less than the given threshold, it is the same, otherwise it is not the same.

所述的清洗半结构化文本就是将原始半结构化文本中不规范的字符(如&、<、>)去除，修改不匹配的标签(如标签<img>，其无终止标签，标签应改为<img/>)和无效的字符串格式(如：a＝0，0为无效字符串，应改为a＝”0”)，最终将原始半结构化文本变成符合规范的半结构化文本。The described cleaning of semi-structured text is to remove the non-standard characters (such as &, <, >) in the original semi-structured text, and modify the unmatched tags (such as the tag <img>, which has no termination tag, and the tag should be changed to is <img/>) and an invalid string format (such as: a=0, 0 is an invalid string, it should be changed to a="0"), and finally the original semi-structured text becomes a semi-structured one that conforms to the specification text.

所述的对半结构化文本编码，在半结构化文本中每一个标签都是成对的，由起始标签和终止标签构成；将所有的起始标签编码为1，终止标签编码为-1，最后按照文本中标签的出现顺序得到一个文本结构编码序列，则这个序列表示了文本的结构特征。在编码过程中剔除了半结构化文本的内容，仅保留相应标签。In the semi-structured text encoding, each tag in the semi-structured text is a pair, consisting of a start tag and an end tag; all start tags are encoded as 1, and the end tag is encoded as -1 , and finally get a text structure encoding sequence according to the order of appearance of the tags in the text, and this sequence represents the structural features of the text. During the encoding process, the content of the semi-structured text is removed, and only the corresponding tags are kept.

所述的通过小波变换获得半结构化文本结构特征按照下述过程得到：The described semi-structured text structural features obtained by wavelet transform are obtained according to the following process:

(1)用标签出现位置表示时间坐标，则得文本结构编码序列的方波信号表示；(1) represent time coordinates with label occurrence position, then get the square wave signal representation of text structure coding sequence;

(2)用哈尔(Haar)小波对文本结构编码序列方波信号进行小波变换，得到相应的小波系数向量；(2) Use Haar (Haar) wavelet to carry out wavelet transformation on the square wave signal of the text structure coding sequence, and obtain the corresponding wavelet coefficient vector;

(3)对小波系数向量进行压缩，即将小波系数序列的绝对值与指定阈值进行比较，小于或等于阈值的点变为0；大于阈值的点变为该点值与阈值的差，得到由0值填充的稀疏系数序列；(3) Compress the wavelet coefficient vector, that is, compare the absolute value of the wavelet coefficient sequence with the specified threshold value, and the point less than or equal to the threshold value becomes 0; the point greater than the threshold value becomes the difference between the point value and the threshold value. Sparse coefficient sequence filled with values;

(4)将非0系数平方，并按照从大到小排序，取前m个最大值，并记录该系数对应的位置信息，得到二维结构特征向量，即半结构化文本结构特征。如下所示：(4) Square the non-zero coefficients and sort them from large to small, take the first m maximum values, and record the position information corresponding to the coefficients to obtain a two-dimensional structural feature vector, that is, semi-structured text structural features. As follows:

其中F^a表示半结构化文本(a)的结构特征向量，

表示半结构化文本(a)经过小波变换之后第m个系数的平方，

表示该系数对应的位置。where F ^a represents the structural feature vector of the semi-structured text (a),

Represents the square of the mth coefficient of semi-structured text (a) after wavelet transformation,

Indicates the position corresponding to the coefficient.

所述的计算结构相似性方法是，首先对结构特征向量进行归一化处理，然后根据下式计算两篇半结构化文本的结构距离：The method for calculating structural similarity is to first normalize the structural feature vector, and then calculate the structural distance of two semi-structured texts according to the following formula:

$Dist Dist (({F f}^{a a},, {F f}^{b b})) = = \frac{11}{22} {Σ Σ}_{i i = = 11}^{m m} \sqrt{{(({\overset{~ ~}{e e}}_{i i}^{a a} - - {\overset{~ ~}{e e}}_{i i}^{b b}))}^{22}} + + \frac{11}{22} {Σ Σ}_{i i = = 11}^{m m} \sqrt{{(({\overset{~ ~}{n no}}_{i i}^{a a} - - {\overset{~ ~}{n no}}_{i i}^{b b}))}^{22}};;$

其中Dist(F^a，F^b)表示半结构化文本(a)和半结构化文本(b)之间的结构距离，

表示半结构化文本(a)经过归一化处理之后的第i个小波系数平方值，

表示该系数对应位置经过归一化处理之后的值，

表示半结构化文本(b)经过归一化处理之后的第i个小波系数平方值，

表示该系数对应位置经过归一化处理之后的值，m表示小波系数的个数。where Dist(F ^a , F ^b ) represents the structural distance between semi-structured text (a) and semi-structured text (b),

Indicates the square value of the ith wavelet coefficient of the semi-structured text (a) after normalization,

Indicates the value of the corresponding position of the coefficient after normalization processing,

Indicates the square value of the ith wavelet coefficient of the semi-structured text (b) after normalization processing,

Indicates the value of the corresponding position of the coefficient after normalization processing, and m indicates the number of wavelet coefficients.

所述的判定结构是否雷同方法是，根据半结构化文本结构之间的结构距离来判定两篇半结构化文本是否结构雷同，如果结构距离小于给定阈值则雷同，否则不雷同。The method for judging whether the structures are similar is to judge whether the two semi-structured texts are similar in structure according to the structural distance between the semi-structured text structures. If the structural distance is less than a given threshold, they are similar, otherwise they are not.

距离值刻画了文本结构之间的相似性，距离值越大文本结构越不相似，距离值越小文本结构越相似。若距离值小于给定阈值，则该两篇半结构化文本的结构雷同性超标，可判定为结构雷同。从而帮助检测人员实现了文本结构雷同的检测。The distance value describes the similarity between text structures. The larger the distance value is, the less similar the text structure is, and the smaller the distance value is, the more similar the text structure is. If the distance value is less than a given threshold, the structural similarity of the two semi-structured texts exceeds the standard, which can be determined as structural similarity. Thereby, it helps the inspectors to realize the detection of similar text structures.

本发明方法所述的半结构化文本包括可扩展标记语言(eXtensibleMarkup Language，XML)文本和超文本标记语言(HyperText Markup Language，HTML)文本。The semi-structured text described in the method of the present invention includes Extensible Markup Language (eXtensibleMarkup Language, XML) text and HyperText Markup Language (HyperText Markup Language, HTML) text.

附图说明： Description of drawings:

图1是XML文本结构复制检测方法的流程图。FIG. 1 is a flow chart of a method for detecting duplication of an XML text structure.

图2是XML结构示意图(a)。Figure 2 is a schematic diagram of the XML structure (a).

图3是XML结构示意图(b)。Fig. 3 is a schematic diagram of XML structure (b).

具体实施方式： Detailed ways:

下面结合附图对本发明作进一步说明。The present invention will be further described below in conjunction with accompanying drawing.

本发明提供了一种基于小波变换的半结构化文本结构复制检测的方法，能够帮助检测人员快速、准确检测出结构雷同的半结构化文本，并减少漏检测和误检测，缩短检测的时间，达到打击非法复制、剽窃等行为的目的。本发明的基本思路是：首先我们将符合规范的半结构化文本中的内容全都剔除，仅保留由标签构成的结构框架。通过标签编码方法给半结构化文本中的标签进行编码，按照标签的嵌套及先后顺序得到一个有序的编码序列。将这个序列看做一个由时间轴上等间隔的点组成的时间序列，这样就可以用小波变换的方法来提取这个时序信号的特征，那么时间序列信号的特征即就是半结构化文本结构特征。最后通过计算特征向量间的相似度来度量半结构化文本结构间的相似度，进而检测文本结构是否雷同。The invention provides a method for detecting duplication of a semi-structured text structure based on wavelet transform, which can help inspectors quickly and accurately detect semi-structured texts with the same structure, reduce missed detection and false detection, and shorten the detection time. To achieve the purpose of combating illegal copying, plagiarism and other acts. The basic idea of the present invention is: firstly, we remove all the content in the semi-structured text conforming to the specification, and only keep the structural framework composed of tags. The tags in the semi-structured text are encoded by the tag encoding method, and an ordered encoding sequence is obtained according to the nesting and sequence of the tags. Think of this sequence as a time series composed of equally spaced points on the time axis, so that the wavelet transform method can be used to extract the characteristics of this time series signal, then the characteristics of the time series signal are the semi-structured text structure features. Finally, the similarity between semi-structured text structures is measured by calculating the similarity between feature vectors, and then whether the text structure is similar is detected.

依照本发明方法的思路，如参考图1所示的检测流程，该方法至少包括清洗半结构化文本(01)、对半结构化文本编码(02)、通过小波变换获得半结构化文本结构特征(03)、计算结构相似性(04)、判定结构是否雷同(05)等步骤。According to the idea of the method of the present invention, as referring to the detection process shown in Figure 1, the method at least includes cleaning the semi-structured text (01), encoding the semi-structured text (02), obtaining the structural features of the semi-structured text by wavelet transform (03), calculating structural similarity (04), judging whether the structure is identical (05) and other steps.

原始半结构化文本通过01步文本清洗后，得到规范化的半结构化文本，再到02步进行文本结构编码，得到一个完全表示半结构化文本结构的编码序列，在03步进行小波变换提取结构特征并将其压缩得到结构特征向量，在04步对结构特征向量归一化之后计算得出结构特征向量之间的结构距离，用以刻画结构相似性。最后，在05步将04步得到的结果与给定阈值比较，判断两篇半结构化文本的结构是否相似(即是否雷同)。若结构相似则判定为雷同(07步)，否则判定为不雷同(06步)。After the original semi-structured text is cleaned in step 01, the standardized semi-structured text is obtained, and then the text structure is encoded in step 02, and a coding sequence that fully represents the structure of the semi-structured text is obtained, and the structure is extracted by wavelet transform in step 03 The features are compressed to obtain the structural feature vector, and the structural distance between the structural feature vectors is calculated after normalizing the structural feature vectors in step 04 to describe the structural similarity. Finally, in step 05, compare the result obtained in step 04 with a given threshold to determine whether the structures of the two semi-structured texts are similar (that is, whether they are identical). If the structure is similar, it is judged as identical (step 07), otherwise it is judged as not identical (step 06).

以下是发明人给出的较佳实施例。The following are preferred embodiments given by the inventor.

参照图2为一篇原始XML文本(a)剔除文本内容并忽略标签值信息后所保留的XML文本的纯结构图。该文本的原始标签序列为：Referring to FIG. 2, it is a pure structure diagram of an original XML text (a) after removing text content and ignoring tag value information. The original tag sequence for this text is:

参照图3为另一篇原始XML文本(b)剔除文本内容并忽略标签值信息后所保留的XML文本的纯结构图。该文本的原始标签序列为：Referring to FIG. 3 is another original XML text (b) a pure structure diagram of the XML text retained after removing the text content and ignoring the tag value information. The original tag sequence for this text is:

根据图1的02步骤，对XML文本结构进行编码，将所有的起始标签(形如“<>”)编码为1，所有的终止标签(形如“</>”)编码为-1。则图2所示的XML文本结构对应编码序列为：According to step 02 in Figure 1, encode the XML text structure, encode all start tags (like "<>") to 1, and all end tags (like "</>") to -1. Then the corresponding encoding sequence of the XML text structure shown in Figure 2 is:

Enc(a)＝{1，1，1，-1，1，-1，-1，1，1，-1，1，-1，-1，-1}Enc(a)={1, 1, 1, -1, 1, -1, -1, 1, 1, -1, 1, -1, -1, -1}

图3所示的XML文本结构对应编码序列为：The corresponding encoding sequence of the XML text structure shown in Figure 3 is:

Enc(b)＝{1，1，1，-1，1，-1，-1，1，1，-1，1，-1，-1，1，1，-1，1，-1，-1，-1}Enc(b)={1, 1, 1, -1, 1, -1, -1, 1, 1, -1, 1, -1, -1, 1, 1, -1, 1, -1, -1,-1}

根据图1的03步骤，对XML文本进行小波变换提取结构特征时，首先将要待比较的两个序列长度变为一致。将序列Enc(a)和序列Enc(b)的长度补齐，给较短序列的末尾补0，得到：According to step 03 in Figure 1, when performing wavelet transform on XML text to extract structural features, the lengths of the two sequences to be compared are first changed to be the same. Complement the length of sequence Enc(a) and sequence Enc(b), and add 0 to the end of the shorter sequence to get:

Enc(a)＝{1，1，1，-1，1，-1，-1，1，1，-1，1，-1，-1，-1，0，0，0，0，0，0}Enc(a)={1, 1, 1, -1, 1, -1, -1, 1, 1, -1, 1, -1, -1, -1, 0, 0, 0, 0, 0 ,0}

然后对上述信号运用哈尔(Haar)小波进行最大尺度下小波变换，得到其系数序列。Then, Haar (Haar) wavelet is used to carry out the maximum scale wavelet transform on the above signal to obtain its coefficient sequence.

Coef(a)＝{0，0，1，0，0.7071，0.7071，0，1，0，0，-1，0，0，1.4142，1.4142，-1.4142，1.4142，1.4142，0，0，0，0}Coef(a)={0, 0, 1, 0, 0.7071, 0.7071, 0, 1, 0, 0, -1, 0, 0, 1.4142, 1.4142, -1.4142, 1.4142, 1.4142, 0, 0, 0, 0}

Coef(b)＝{0.5，-2，0.5，0，0.7071，0，0，1，0，0，0，1，0，1.4142，1.4142，-1.4142，1.4142，1.4142，-1.4142，1.4142，1.4142，0}Coef(b)={0.5, -2, 0.5, 0, 0.7071, 0, 0, 1, 0, 0, 0, 1, 0, 1.4142, 1.4142, -1.4142, 1.4142, 1.4142, -1.4142, 1.4142, 1.4142 ,0}

接下来对小波系数压缩之后得到两篇XML文本的结构特征向量如下：Next, after compressing the wavelet coefficients, the structural feature vectors of the two XML texts are obtained as follows:

F^a＝[(1.7272，14)，(1.7272，15)，(1.7272，16)，(1.7272，17)，(1.7272，18)，(0.81，3)(0.81，8)，(0.81，11)，(0.3686，5)，(0.3686，6)]F ^a = [(1.7272, 14), (1.7272, 15), (1.7272, 16), (1.7272, 17), (1.7272, 18), (0.81, 3) (0.81, 8), (0.81, 11) , (0.3686, 5), (0.3686, 6)]

F^b＝[(1.6716，2)，(0.5，14)，(0.5，15)，(0.5，16)，(0.5，17)，(0.5，18)，(0.5，19)，(0.5，20)，(0.5，21)，(0.0858，8)]F ^b = [(1.6716, 2), (0.5, 14), (0.5, 15), (0.5, 16), (0.5, 17), (0.5, 18), (0.5, 19), (0.5, 20 ), (0.5, 21), (0.0858, 8)]

最后计算其结构距离为：Finally, the structural distance is calculated as:

Dist(F^a，F^b)＝3.9103Dist(F ^a , F ^b )=3.9103

假设给定阈值为：4。则由于Dist(F^a，F^b)＝3.9103＜4，因此判定文本(a)与文本(b)的结构雷同。Assume the given threshold is: 4. Then, since Dist(F ^a , F ^b )=3.9103<4, it is determined that text (a) and text (b) have the same structure.

Claims

1. A semi-structured text structure duplication detection method based on wavelet transform, it is characterized in that, comprises the following steps: cleaning semi-structured text, coding semi-structured text, obtaining semi-structured text structure feature by wavelet transform, calculating Structural similarity, judging whether the structure is similar;

Clean semi-structured text to remove non-standard characters in semi-structured text, modify unmatched tags and invalid string formats, and convert original semi-structured text into semi-structured text that conforms to specifications;

Encoding the semi-structured text Serialize the structure of the semi-structured text to obtain a structure encoding sequence;

Obtaining the structural features of semi-structured text by wavelet transform refers to performing wavelet transform on the structural coding sequence to obtain the structural feature vector of semi-structured text; calculating the structural similarity is to calculate the structural distance between the semi-structured text feature vectors, so as to obtain the semi-structured text similarity between structured text structures;

To determine whether the structure is similar, it is determined whether the structure of the two semi-structured texts is similar according to the structural distance between the semi-structured text structures. If the structural distance is less than a given threshold, they are similar, otherwise they are not.

2. duplication detection method according to claim 1 is characterized in that: when semi-structured text is encoded, each label in semi-structured text is all paired, is made of start label and end label; The start tag is encoded as 1, the end tag is encoded as -1, and finally a text structure encoding sequence is obtained according to the order of appearance of the tags in the text, and this sequence represents the structural characteristics of the text; semi-structured text is eliminated during the encoding process , keeping only the corresponding tags.

3. copy detection method according to claim 1, is characterized in that: when obtaining semi-structured text structure feature by wavelet transform,

(1) represent time coordinates with label occurrence position, then get the square wave signal representation of text structure coding sequence;

(2) Carry out wavelet transform to the square wave signal of the text structure coding sequence with Haar wavelet, and obtain the corresponding wavelet coefficient vector;

(3) Compress the wavelet coefficient vector, that is, compare the absolute value of the wavelet coefficient sequence with the specified threshold value, and the point less than or equal to the threshold value becomes 0; the point greater than the threshold value becomes the difference between the point value and the threshold value. Sparse coefficient sequence filled with values;

(4) Square the non-zero coefficients and sort them from large to small, take the first m maximum values, and record the position information corresponding to the coefficients to obtain a two-dimensional structural feature vector, that is, semi-structured text structural features.

4. duplication detection method according to claim 1, is characterized in that: when calculating structural similarity, at first structural feature vector is carried out normalization process, then calculates the structural distance of two semi-structured texts according to following formula:

Dist Dist (({F f}^{a a},, {F f}^{b b})) = = \frac{11}{22} {Σ Σ}_{i i = = 11}^{m m} \sqrt{{(({\overset{~ ~}{e e}}_{i i}^{a a} - - {\overset{~ ~}{e e}}_{i i}^{b b}))}^{22}} + + \frac{11}{22} {Σ Σ}_{i i = = 11}^{m m} \sqrt{{(({\overset{~ ~}{n no}}_{i i}^{a a} - - {\overset{~ ~}{n no}}_{i i}^{b b}))}^{22}}

where Dist(F ^a , F ^b ) represents the structural distance between semi-structured text a and semi-structured text b, Indicates the square value of the ith wavelet coefficient of semi-structured text a after normalization processing,

Indicates the value of the corresponding position of the coefficient after normalization processing, Indicates the square value of the ith wavelet coefficient of the semi-structured text b after normalization,

5. The copy detection method according to claim 1, characterized in that: the semi-structured text includes Extensible Markup Language XML text and Hypertext Markup Language HTML text.