CN108764352B

CN108764352B - Method and device for detecting repeated page content

Info

Publication number: CN108764352B
Application number: CN201810545595.7A
Authority: CN
Inventors: 柏馨; 张婷; 崔一; 项金鑫; 尹飞; 刘盼盼; 薛大伟; 魏晨辉; 邢潘红
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-05-25
Filing date: 2018-05-25
Publication date: 2022-09-27
Anticipated expiration: 2038-05-25
Also published as: CN108764352A

Abstract

The invention provides a method and a device for detecting repeated page content, wherein the method comprises the following steps: performing interface screenshot on a page to be detected to obtain an image of the page; identifying a segmentation region from the image according to the gray information of the image; segmenting the image according to the segmentation area to obtain a plurality of image blocks; clustering is carried out according to the image characteristics of the image blocks, and the similarity degree of the image blocks belonging to the same cluster is determined; and determining whether the repeated page content exists according to the similarity degree between the image blocks belonging to the same cluster. According to the method, the image obtained by the interface screenshot is segmented through an image technology, so that the image blocks are compared, repeated detection of the page content can be completed, the technical problems that in the prior art, automatic verification needs to prepare a large number of additional reference objects, and the testing cost is high are solved, and the efficiency of detecting the repeated page content is improved.

Description

Method and device for detecting repeated page content

Technical Field

The invention relates to the technical field of mobile terminals, in particular to a method and a device for detecting repeated page content.

Background

The problem of repeated page contents can occur in front-end pages such as Application programs (APP), web pages and the like, so that the repeated page content test becomes an important test link. In the actual use process, the repetition of the page content not only seriously affects the user experience, but also causes the waste of network resources.

In the prior art, a manual detection method is still adopted to detect the page content aiming at the problem of repeated page content, and the detection accuracy depends on manual experience. Although the prior art also discloses an automatic verification technology, the technology needs to additionally prepare a reference object to compare the content of the page with the reference object, the mode is easily limited by the reference object, the capability of finding the content of the repeated page is weak, and if the detection accuracy of the repeated page needs to be improved, a large number of additional reference objects need to be prepared, so that the test cost is high. In summary, the content detection of the duplicate pages in the prior art is inefficient.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, the invention provides a method for detecting the content of the repeated page, which is used for efficiently completing the detection of the content of the repeated page, solving the technical problems that in the prior art, manual detection depends on manual experience, and a large number of reference objects are required to be prepared additionally for automatic verification, so that the test cost is high, and improving the efficiency of detecting the content of the repeated page.

The invention provides a device for detecting repeated page content.

The invention provides a computer device.

The invention provides a computer readable storage medium.

An embodiment of a first aspect of the present invention provides a method for detecting content of a duplicate page, including:

carrying out interface screenshot on a page to be detected to obtain an image of the page;

identifying a segmentation region from the image according to the gray information of the image;

segmenting the image according to the segmentation area to obtain a plurality of image blocks;

clustering is carried out according to the image characteristics of the image blocks, and the similarity degree of the image blocks belonging to the same cluster is determined;

and determining whether the repeated page content exists according to the similarity degree between the image blocks belonging to the same cluster.

According to the method for detecting the content of the repeated page, disclosed by the embodiment of the invention, the image of the page is obtained by carrying out interface screenshot on the page to be detected; identifying a segmentation region from the image according to the gray information of the image; segmenting the image according to the segmentation area to obtain a plurality of image blocks; clustering is carried out according to the image characteristics of the image blocks, and the similarity degree of the image blocks belonging to the same cluster is determined; and determining whether the repeated page content exists according to the similarity degree between the image blocks belonging to the same cluster. According to the method, the image obtained by the interface screenshot is segmented through an image technology, so that each image block is compared, detection of the content of the repeated page can be completed without additionally preparing a reference object, the technical problems that manual detection depends on manual experience, a large number of reference objects are additionally prepared for automatic verification in the prior art, and the testing cost is high are solved, and the efficiency of content detection of the repeated page is improved.

An embodiment of a second aspect of the present invention provides an apparatus for detecting content of a duplicate page, including:

the acquisition module is used for carrying out interface screenshot on a page to be detected to obtain an image of the page;

the identification module is used for identifying a segmentation region from the image according to the gray information of the image;

the segmentation module is used for segmenting the image according to the segmentation region to obtain a plurality of image blocks;

the clustering module is used for clustering according to the image characteristics of the image blocks and determining the similarity degree of the image blocks belonging to the same cluster;

and the detection module is used for determining whether the repeated page content exists according to the similarity degree between the image blocks belonging to the same cluster.

According to the repeated page content detection device, the interface screenshot is carried out on the page to be detected, so that the image of the page is obtained; identifying a segmentation region from the image according to the gray information of the image; segmenting the image according to the segmentation area to obtain a plurality of image blocks; clustering is carried out according to the image characteristics of the image blocks, and the similarity degree of the image blocks belonging to the same cluster is determined; and determining whether the repeated page content exists according to the similarity degree between the image blocks belonging to the same cluster. According to the method, the image obtained by the interface screenshot is segmented through an image technology, so that all the image blocks are compared, the detection of the content of the repeated page can be completed without additionally preparing a reference object, the technical problems that in the prior art, manual detection depends on manual experience and automatic verification needs to additionally prepare a large number of reference objects, the test cost is high are solved, and the efficiency of detecting the content of the repeated page is improved.

An embodiment of a third aspect of the present invention provides a computer device, including: a processor; a memory for storing the processor-executable instructions; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, and is used for executing the method for detecting the content of the duplicate page described in the embodiment of the first aspect.

A fourth aspect of the present invention provides a computer-readable storage medium, where instructions of the storage medium, when executed by a processor, are configured to perform the method for detecting content of a duplicate page described in the first aspect.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flowchart of a method for detecting content of a duplicate page according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a method for detecting content of a duplicate page according to a second embodiment of the present invention;

FIG. 3 is a flowchart illustrating another method for detecting content of a duplicate page according to an embodiment of the present invention;

FIG. 4 is an exemplary diagram of a valid problem found in a mobile product using the repeat page content detection method of the present invention;

fig. 5 is a schematic structural diagram of an apparatus for detecting content of a repeated page according to an embodiment of the present invention; and

FIG. 6 illustrates a block diagram of an exemplary computer device suitable for use to implement embodiments of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

At present, the application of mobile terminal equipment in daily life of people is increasingly wide, and a user can find that the phenomenon of repeated page contents exists in front-end pages such as APP (application), webpages and the like in the using process, so that the user experience is poor, and the repeated page content test becomes an important test link. In the prior art, for content detection of repeated pages, a reference-based image contrast discovery method is adopted, and accurate screenshot is manually performed according to an automatic track preset in each step, wherein the reference screenshot in each step is required to be strictly consistent with the steps in automatic execution. However, in the actual detection process, the traditional manual detection technology has obvious limitations, and the repeated page content cannot be effectively found.

The method aims at the problems that the repeated page content detection technology in the prior art is limited and the detection result is not ideal. In the embodiment of the invention, the repeated detection of the page content is completed through an image technology, in the detection process, no additional reference object is needed, after the mobile phone screenshot is obtained, the page is firstly segmented to obtain a plurality of image blocks, then clustering is carried out according to the image characteristics of the image blocks, the similarity degree of the image blocks belonging to the same cluster is determined, and finally whether the repeated page content exists is determined. To facilitate a better understanding, several common terms appearing in the present invention are first introduced:

the first similarity: a threshold for picture similarity when it is determined that there is duplicate page content between two tiles.

The second similarity is: a threshold for text similarity when it is determined that there is duplicate page content between two tiles.

The method and apparatus of embodiments of the present invention are described below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of a method for detecting content of a duplicate page according to an embodiment of the present invention.

As shown in fig. 1, the method for detecting the content of the repeated page includes the following steps:

step 101, performing interface screenshot on a page to be detected to obtain an image of the page.

Specifically, before detecting the content of the page, an image of the page to be detected is obtained first, in this embodiment, interface screenshot is performed on the page to be detected through the mobile device, and screenshot is performed by the mobile devices of different operating systems through different operations, so that the image of the page to be detected is obtained finally.

And 102, identifying a segmentation region from the image according to the gray information of the image.

The gray scale indicates that an image is displayed using black tones representing an object, that is, black having a different saturation from black as a reference color. Each gray object has a luminance value from 0% (white) to 100% (black) of the gray bar. The gray information of the image refers to the gray value of each pixel point in the image obtained after the image is subjected to gray processing. The pixel is the minimum unit of an image, one image is composed of a plurality of pixel points, and the pixel points are arranged in an array.

To illustrate, since each pixel in a color image has R, G, B three components in color and 255 values in each component, a pixel can have 1600 tens of thousands (255 x 255) of color variations. The gray image is a special color image with R, G, B components being the same, and the variation range of one pixel point is 255, so that in digital image processing, images of various formats are generally converted into gray images to reduce the calculation amount of subsequent images. The description of a grayscale image, like a color image, still reflects the distribution and characteristics of the chrominance and luminance levels, both globally and locally, of the entire image.

Since the image of the page to be detected is colorful in this embodiment, after the image of the page to be detected is obtained, the image is grayed to obtain the grayscale information of the image, and then the segmentation area is identified from the image. The graphic graying process is a process of converting a color image into a grayscale image.

Optionally, a specific method for identifying a segmentation region from an image of a page to be detected is as follows: searching along the row or the column of each pixel point array to obtain at least one row of pixel points or at least one column of pixel points with the same gray scale, taking the searched at least one row of pixel points or at least one column of pixel points as a segmentation area, and merging adjacent segmentation areas.

And 103, segmenting the image according to the segmentation areas to obtain a plurality of image blocks.

In particular, image segmentation is a technique and process that divides an image into several specific, unique properties and addresses objects of interest. At present, the main image segmentation methods include threshold segmentation, region segmentation, edge segmentation, and the like. The image segmentation method based on edge detection is adopted in the embodiment, and the basic idea is to determine edge pixels in an image and then connect the pixels together to form a required region boundary. The segmentation problem is solved by detecting edges containing different regions, i.e. where the grey levels or structures have abrupt changes, indicating the end of one region, and where another region starts, such discontinuities are called edges. Different images have different gray levels, and the boundary generally has obvious edges, so that the images can be segmented by utilizing the characteristics.

As a possible implementation manner, the image to be detected may be segmented along the edge of the segmented area, so as to obtain a plurality of image blocks. In particular, the gray values of the pixels at the edges in the image are discontinuous, which discontinuity can be detected by taking the derivative.

As another possible implementation manner, a segmentation line may also be determined inside the segmented region, and the image to be detected is segmented along the segmentation line to obtain a plurality of image blocks.

It should be noted that, after the image to be detected is segmented to obtain a plurality of image blocks, the area ratio of each image block in the image needs to be determined, and the image block with the area ratio lower than the threshold ratio needs to be deleted. Tiles with an area fraction below the threshold fraction are deleted because tiles with an area fraction that is too low do not have much impact on clustering the tiles.

And 104, clustering according to the image characteristics of the image blocks, and determining the similarity of the image blocks belonging to the same cluster.

It should be noted that, after the image to be detected is segmented to obtain a plurality of image blocks, for each image block, color space conversion is performed first, the RGB space is converted into HSV space, then color feature extraction is performed on the image block after color space conversion, and finally the extracted color feature and the area ratio of the image block in the image are used as image features. And clustering and grouping the image blocks according to the image characteristics of the image blocks, wherein the image blocks in the same group have similar image characteristics.

Specifically, there are various color space models for color images, and in image processing, an RGB model and an HSV model are often used. The RGB model is a color space established based on the theory of three primary colors of human vision, red (R), green (G) and blue (B), i.e. it is believed that the appropriate mixing of 3 colors of red (R), green (G) and blue (B) can cause the perception of any color on the electromagnetic spectrum. HSV models are color spaces created based on the human visual perception properties, where the hue (H) represents different colors, such as red, green, blue; saturation (S) represents the shade of the color, e.g. dark blue, light blue; the brightness (V) represents the degree of brightness of the color, such as very bright (bright white) and very dark (dark). The conversion from the RGB space to the HSV space is realized by a conversion formula, and there are various expression forms for the conversion formula, and the principle is the same, and the following conversion formula is introduced in this embodiment:

in the above formula, the value ranges of R, G and B are [0,255 ]; the value range of H is [0,360 ]; the value range of S is [0,1 ]; the value range of V is [0,255 ]. In actual image processing, the value ranges of H, S, and V are often normalized to [0,1 ].

It should be noted that, after the color space conversion is performed on the image block, firstly, the color feature extraction is performed on the converted image block. For HSV color space, a color histogram is usually used for feature extraction, and the matching method includes: histogram intersection method, distance method, center distance method, reference color table method, cumulative color histogram method. By converting the RGB color space into the HSV color space, the chrominance, the saturation and the brightness are separated, and the image similarity can be more accurately identified according to the color characteristics.

Further, according to the extracted image features of the plurality of image blocks, clustering and grouping the image blocks, and determining the similarity degree of the image blocks belonging to the same cluster. Specifically, for two image blocks belonging to the same cluster, the picture similarity between the two image blocks is determined first. And if the picture similarity between the two image blocks is greater than the first similarity, respectively carrying out character recognition on the two image blocks to obtain character content and character positions. And finally, determining the character similarity degree of the two image blocks according to the character content and the character position.

The image similarity between two image blocks is determined by a template matching and coefficient correlation method, specifically, the template matching refers to finding the position of a target template in a frame of image, and the most image position of the template is the target. If only all sub-regions of the whole image are compared with the target template, the sub-region most similar to the target template is found, and is the position of the target. And then, calculating a correlation coefficient between the target template and the sub-region to measure the similarity degree of the target template and the sub-region. In this embodiment, first, a matrix corresponding to each image block is generated according to the pixel value of each pixel point in the two image blocks. And determining the image similarity degree between the two image blocks according to the matrix correlation coefficient between the corresponding matrixes of the two image blocks.

And 105, determining whether repeated page content exists according to the similarity degree between the image blocks belonging to the same cluster.

Specifically, for the image blocks belonging to the same cluster, the similarity degree of the image blocks is determined through image similarity comparison and character similarity comparison, and whether the repeated page content exists is further determined.

The image similarity between two image blocks is determined by a template matching and coefficient correlation method, specifically, the template matching refers to finding the position of a target template in a frame of image, and the most image position of the template is the target. If only the sub-region most similar to the target template is found by comparing all sub-regions of the whole map with the target template, it is the position of the target. And then, calculating a correlation coefficient between the target template and the sub-region to measure the similarity degree of the target template and the sub-region.

Further, the text content in the area is recognized by an Optical Character Recognition scanning (OCR) technique, and the text Recognition module extracts features of different sample chinese characters to complete Recognition.

As a possible implementation situation, for two image blocks belonging to the same cluster, if the picture similarity between the two image blocks is greater than the first similarity, and the text similarity between the two image blocks is greater than the second similarity, it is determined that there is a repeated page content.

As another possible implementation case, for two blocks belonging to the same cluster, if the picture similarity between the two blocks is not greater than the first similarity, or the text similarity between the two blocks is not greater than the second similarity, it is determined that there is no repeated page content.

According to the method for detecting the content of the repeated page, disclosed by the embodiment of the invention, the image of the page is obtained by carrying out interface screenshot on the page to be detected; identifying a segmentation region from the image according to the gray information of the image; segmenting the image according to the segmentation area to obtain a plurality of image blocks; clustering is carried out according to the image characteristics of the image blocks, and the similarity degree of the image blocks belonging to the same cluster is determined; and determining whether the repeated page content exists according to the similarity degree between the image blocks belonging to the same cluster. According to the method, the image obtained by the interface screenshot is segmented through an image technology, so that each image block is compared, repeated detection of the page content can be completed without additionally preparing a reference object, the technical problems that manual detection depends on manual experience and a large number of reference objects are additionally prepared for automatic verification in the prior art, the testing cost is high are solved, and the efficiency of repeated page content detection is improved.

In order to clearly illustrate the previous embodiment and more specifically understand how to implement repeated content detection of an image after the image to be detected is obtained, the present embodiment provides another method for detecting repeated page content, and fig. 2 is a schematic flow diagram of a method for detecting repeated page content according to a second embodiment of the present invention.

As shown in fig. 2, the method for detecting content of a duplicate page may include the following steps:

step 201, performing interface screenshot on a page to be detected to obtain an image of the page.

Specifically, the image of the page to be detected is obtained by performing different screenshot operations on different pages to be detected.

As a possible situation, screenshot is performed on a page to be detected by the APP through the mobile device, and different operation systems perform screenshot by adopting different operations, so that an image of the page to be detected is finally obtained.

As another possible situation, when detecting a page of a web page, capturing a screenshot of the page to be detected by various screenshot tools, screenshot software, and the like in the computer device to obtain an image of the page to be detected.

Step 202, determining the gray level of each pixel point according to the gray level information of the image, identifying the segmentation areas from the image, and merging the adjacent segmentation areas.

Specifically, the gray information of the image refers to the gray value of each pixel point in the image obtained after the graph is grayed. The pixel is the minimum unit of an image, one image is composed of a plurality of pixel points, and the pixel points are arranged in an array. The detection terminal determines the gray value of each pixel point in the image according to the gray information of the image, and then the segmentation area is identified from the image through the identification module.

Further, a specific method for identifying the segmentation region from the image of the page to be detected is as follows: and searching along the row or the column of each pixel point array to obtain at least one row of pixel points or at least one column of pixel points with the same gray level, and taking the searched at least one row of pixel points or at least one column of pixel points as a segmentation area. And then searching whether the adjacent areas have similar features, and if so, merging the adjacent divided areas.

And 203, segmenting the image according to the segmented regions to obtain a plurality of image blocks.

In particular, image segmentation is a technique and process that divides an image into several specific, unique properties and addresses objects of interest. At present, the main image segmentation methods include threshold segmentation, region segmentation, edge segmentation, and the like. The basic idea of the image segmentation method based on edge detection is to determine edge pixels in an image, and then connect the edge pixels together to form a required region boundary. The segmentation problem is solved by detecting edges containing different regions, i.e. where the grey levels or structures have abrupt changes, indicating the end of one region, and where another region starts, such discontinuities are called edges. Different images have different gray levels, and the boundary generally has obvious edges, so that the images can be segmented by utilizing the characteristics.

As a possible implementation manner, the image to be detected may be segmented along the edge of the segmented region, so as to obtain a plurality of image blocks. In particular, the gray values of the pixels at the edges in the image are discontinuous, which discontinuity can be detected by taking the derivative.

As another possible implementation manner, a segmentation line may be determined inside the segmented region, and the image to be detected is segmented along the segmentation line to obtain a plurality of image blocks.

And step 204, determining the area ratio of each image block in the image, and deleting the image blocks with the area ratio lower than the threshold ratio.

Specifically, after the image to be detected is segmented to obtain a plurality of image blocks, the area ratio of each image block in the image needs to be determined, and the image blocks with the area ratios lower than the threshold ratio are deleted. For example, regions in the page with a high proportion of less than 2 percent or an area proportion of less than 2 thousandths of a percent are deleted.

And step 205, clustering according to the image characteristics of the multiple image blocks, and determining the similarity of the image blocks belonging to the same cluster.

It should be noted that, after an image to be detected is segmented to obtain a plurality of image blocks, for each image block, color space conversion is performed first, conversion from an RGB space to an HSV space is performed, then color feature extraction is performed on the image block after color space conversion, and finally, the extracted color feature and the area ratio of the image block in the image are used as image features. The conversion from the RGB space to the HSV space is realized by a conversion formula, and the conversion method is the same as that in the first embodiment, which is not described again in this embodiment.

Specifically, after color space conversion is performed on the image blocks, color feature extraction is performed on the converted image blocks first. For HSV color space, a color histogram is usually used for feature extraction, and the matching method includes: histogram intersection method, distance method, center distance method, reference color table method, cumulative color histogram method.

Further, according to the extracted image features of the plurality of image blocks, clustering and grouping the image blocks, and determining the similarity degree of the image blocks belonging to the same cluster. Specifically, firstly, for two image blocks belonging to the same cluster, the picture similarity between the two image blocks is determined. And if the image similarity between the two image blocks is greater than the first similarity, respectively carrying out character recognition on the two image blocks to obtain character content and character positions. And finally, determining the character similarity degree of the two image blocks according to the character content and the character position.

The image similarity between two image blocks is determined by a template matching and coefficient correlation method, specifically, the template matching refers to finding the position of a target template in a frame of image, and the most image position of the template is the target. If only the sub-region most similar to the target template is found by comparing all sub-regions of the whole map with the target template, it is the position of the target. And then, measuring the similarity degree of the target template and the sub-region by calculating a correlation coefficient between the target template and the sub-region. In this embodiment, first, a matrix corresponding to each image block is generated according to the pixel value of each pixel point in the two image blocks. And determining the image similarity degree between the two image blocks according to the matrix correlation coefficient between the corresponding matrixes of the two image blocks.

And step 206, determining whether the repeated page content exists by judging the picture similarity and the character similarity of two image blocks in the same cluster.

Specifically, for the preprocessed image blocks belonging to the same cluster, the similarity degree of the image blocks is determined through image similarity comparison and character similarity comparison, and whether repeated page content exists is further determined.

Further, the content of the characters in the area is identified by an optical character identification scanning technology, and the character identification module extracts the characteristics of the Chinese characters in different samples to complete identification.

As a possible application scenario, when the method provided in this embodiment is executed, the method for detecting content of a repeated page is the same, but there may be differences in detection steps, and each step may be executed in a combined manner, or one step may be divided into more steps to be executed, this embodiment further provides a method flow as shown in fig. 3, and fig. 3 is a schematic flow diagram for executing another method for detecting content of a repeated page provided in this embodiment of the present invention.

As shown in fig. 3, step 301 is executed to input a picture to be detected to a detection terminal, where the picture to be detected is obtained by performing an interface screenshot on a page to be detected.

Next, step 302 is executed to perform page segmentation on the picture to be detected. This step includes marking the split areas, merging the split areas, and determining the split lines.

Specifically, after the image to be detected is grayed, the gray level of each pixel point is determined according to the gray level information of the image, and then the segmentation regions are identified from the image through the identification module, so that whether the adjacent regions have similar characteristics or not is searched, and if the adjacent regions have similar characteristics, the adjacent segmentation regions are combined. And finally, segmenting the image to be detected along the edge of the segmentation region to obtain a plurality of image blocks, or determining a segmentation line in the segmentation region and segmenting the image to be detected along the segmentation line to obtain a plurality of image blocks.

Further, step 303 is executed to pre-process the segmented image. The step includes filtering the segmented images, extracting regional features and clustering according to the extracted features.

Specifically, after the image to be detected is segmented to obtain a plurality of image blocks, the area ratio of each image block in the image needs to be determined, and the image blocks with the area ratios lower than the threshold ratio are deleted. And then, performing color space conversion on each image block, converting the RGB space into HSV space, performing color feature extraction on the image blocks subjected to the color space conversion, and taking the extracted color features and the area ratio of the image blocks in the image as image features. Finally, clustering and grouping the image blocks according to the extracted image characteristics of the plurality of image blocks,

then, step 304 is executed to perform comparison processing on the preprocessed pictures. Finally, step 305 is executed to output the repeated region.

Specifically, the comparison processing is performed on the preprocessed pictures, and the similarity degree of the picture blocks is determined by performing picture similarity comparison and character similarity comparison on the preprocessed picture blocks belonging to the same cluster, so as to further determine whether the repeated page content exists. And finally, outputting the repeated area of the page content through the detection terminal.

The above only briefly describes the detection steps shown in fig. 3, and the specific detection method is the same as the detection method shown in this embodiment, and is not described here again.

According to the method for detecting the content of the repeated page, disclosed by the embodiment of the invention, the image of the page is obtained by carrying out interface screenshot on the page to be detected; determining the gray level of each pixel point according to the gray level information of the image, identifying the segmentation areas from the image, and merging the adjacent segmentation areas; segmenting the image according to the segmentation area to obtain a plurality of image blocks; determining the area ratio of each image block in the image, and deleting the image blocks with the area ratios lower than a threshold ratio; clustering is carried out according to the image characteristics of a plurality of image blocks, and the similarity degree of the image blocks belonging to the same cluster is determined; and determining whether the repeated page content exists or not by judging the picture similarity and the character similarity of two image blocks in the same cluster. According to the method, the graph obtained by the interface screenshot is segmented through an image technology, so that each image block is compared, repeated detection of the page content can be completed without additionally preparing a reference object, the technical problems that manual detection depends on manual experience and a large number of reference objects are additionally prepared for automatic verification in the prior art, the testing cost is high are solved, and the efficiency of repeated page content detection is improved.

By adopting the repeated content detection method explained in the above embodiment, through testing, the application result shown in fig. 4 can be obtained, and fig. 4 is an exemplary diagram of the repeated page content discovered by applying the repeated content detection method of the present invention. By adopting the detection method, the phenomenon that the content of the three pictures in the picture 4 is repeated, the content of the characters in the two rectangular frames at the upper left corner of the picture at the left side is repeated, and the content of the pictures in the two pictures at the right side is repeated can be found, so that the abnormality exists.

In order to implement the above embodiment, the present invention further provides a device for detecting content of duplicate pages.

Fig. 5 is a schematic structural diagram of an apparatus for detecting content of a repeated page according to an embodiment of the present invention.

As shown in fig. 5, the duplicate page content detection apparatus includes: the system comprises an acquisition module 110, a recognition module 120, a segmentation module 130, a clustering module 140 and a detection module 150.

The obtaining module 110 is configured to perform interface screenshot on a page to be detected to obtain an image of the page.

Specifically, the obtaining module 110 obtains an image of a page by performing interface screenshot on the page to be detected, and different brands of mobile devices perform screenshot by using different operations to obtain a picture to be detected, and then input the picture to be detected into the detection terminal.

An identifying module 120, configured to identify a segmentation region from the image according to the grayscale information of the image.

Specifically, after graying processing is performed on the picture to be detected, the gray value of each pixel point in the picture is obtained. And searching along the row or the column of each pixel point array after graying to obtain at least one row of pixel points or at least one column of pixel points with the same gray level, and marking the searched at least one row of pixel points or at least one column of pixel points as a segmentation area. The identification module 120 identifies the segmentation region from the image according to the gray value of each pixel point.

And the segmentation module 130 is configured to segment the image according to the segmented region to obtain a plurality of image blocks.

Specifically, after the recognition module 120 recognizes the segmentation region from the image, the segmentation module 130 performs a segmentation process on the image, which is a technique and process for dividing the image into a plurality of specific objects having unique properties and presenting objects of interest.

The basic idea of the image segmentation method based on edge detection is to determine edge pixels in an image, and then connect the pixels together to form a required region boundary. The segmentation problem is solved by detecting edges containing different regions, i.e. where the grey levels or structures have abrupt changes, indicating the end of one region, and where another region starts, such discontinuities are called edges. Different images have different gray levels, and the boundary generally has obvious edges, so that the images can be segmented by utilizing the characteristics to obtain a plurality of image blocks.

And the clustering module 140 is configured to perform clustering according to the image features of the multiple image blocks, and determine a similarity degree for the image blocks belonging to the same cluster.

Specifically, the image features of the image blocks refer to that after an image to be detected is segmented to obtain a plurality of image blocks, color space conversion is firstly carried out on each image block, RGB space is converted into HSV space, then color feature extraction is carried out on the image blocks after color space conversion, and finally the extracted color features and the area ratio of the image blocks in the image are used as the image features. The clustering module 140 clusters and groups the image blocks according to the image features of the image blocks, wherein the image blocks in the same group have similar image features.

To explain further, for two blocks belonging to the same cluster, the clustering module 140 first determines the picture similarity between the two blocks. And if the picture similarity between the two image blocks is greater than the first similarity, respectively carrying out character recognition on the two image blocks to obtain character content and character positions. And finally, determining the character similarity degree of the two image blocks according to the character content and the character position.

And the detection module 150 is configured to determine whether there is a repeated page content according to the similarity between the tiles belonging to the same cluster.

Specifically, for the tiles belonging to the same cluster, the detection module 150 determines the similarity degree of the tiles through image similarity comparison and text similarity comparison, and further determines whether there is a repeated page content.

According to the repeated page content detection device, the interface screenshot is carried out on the page to be detected, so that the image of the page is obtained; identifying a segmentation region from the image according to the gray information of the image; segmenting the image according to the segmentation region to obtain a plurality of image blocks; clustering is carried out according to the image characteristics of the image blocks, and the similarity degree of the image blocks belonging to the same cluster is determined; and determining whether the repeated page content exists according to the similarity degree between the image blocks belonging to the same cluster. According to the method, the image obtained by the interface screenshot is segmented through an image technology, so that each image block is compared, repeated detection of the page content can be completed without additionally preparing a reference object, the technical problems that in the prior art, manual detection depends on manual experience and a large number of reference objects need to be additionally prepared for automatic verification, the testing cost is high are solved, and the efficiency of repeated page content detection is improved.

It should be noted that the foregoing explanation on the embodiment of the method for detecting content of a duplicate page is also applicable to the device for detecting content of a duplicate page of the embodiment, and is not repeated here.

In order to implement the above embodiment, the present invention further provides another computer device, including: a processor and a memory for storing processor-executable instructions.

Wherein, the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the duplicate page content detection method proposed by the foregoing embodiment of the present invention.

In order to implement the foregoing embodiments, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method for detecting content of a duplicate page set forth in the foregoing first aspect.

FIG. 6 illustrates a block diagram of an exemplary computer device suitable for use to implement embodiments of the present application. The computer device 12 shown in fig. 6 is only an example and should not bring any limitation to the function and scope of use of the embodiments of the present application.

As shown in FIG. 6, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.

Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 6, and commonly referred to as a "hard drive"). Although not shown in FIG. 6, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only Memory (CD-ROM), a Digital versatile disk Read Only Memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described herein.

Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the computer system/server 12, and/or any device (e.g., network card, modem, etc.) that enables the computer system/server 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public Network such as the Internet via Network adapter 20. As shown, network adapter 20 communicates with the other modules of computer device 12 via bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, to name a few.

The processing unit 16 executes various functional applications and data processing, for example, implementing the methods mentioned in the foregoing embodiments, by executing programs stored in the system memory 28.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Further, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A method for content detection of a repeated page, the method comprising the steps of:

determining whether repeated page content exists according to the similarity degree between the image blocks belonging to the same cluster; the determining the similarity degree of the image blocks belonging to the same cluster comprises the following steps:

determining the picture similarity degree between two image blocks belonging to the same cluster; wherein, include:

respectively generating a matrix corresponding to each image block according to the pixel value of each pixel point in the two image blocks;

determining the picture similarity degree between the two image blocks according to the matrix correlation coefficient between the corresponding matrixes of the two image blocks;

if the picture similarity between the two image blocks is greater than the first similarity, respectively carrying out character recognition on the two image blocks to obtain character content and character positions;

and determining the character similarity degree of the two image blocks according to the character content and the character position.

2. The method for detecting content of a repeated page according to claim 1, wherein the identifying a segmentation region from the image according to the gray scale information of the image comprises:

determining the gray level of each pixel point in the image according to the gray level information; wherein, in the image, each pixel point is arranged in an array;

and searching along the rows or columns of the array in the image to obtain at least one row of pixel points or at least one column of pixel points with the same gray level, and taking the searched at least one row of pixel points or at least one column of pixel points as a segmentation area.

3. The method for detecting content of repeated pages according to claim 2, wherein after the step of using at least one row of searched pixels or at least one column of searched pixels as a partition region, the method further comprises:

and merging the adjacent divided areas.

4. The method for detecting content of repeated pages according to claim 2, wherein said segmenting the image according to the segmentation region to obtain a plurality of segments comprises:

segmenting the image along the edge of the segmentation area to obtain a plurality of image blocks;

or determining a dividing line in the dividing region, and dividing the image along the dividing line to obtain a plurality of image blocks.

5. The method for detecting content of repeated pages according to any one of claims 1 to 4, wherein after segmenting the image according to the segmentation region to obtain a plurality of segments, the method further comprises:

determining the area proportion of each image block in the image;

tiles with area ratios below a threshold ratio are deleted.

6. The method for detecting content of repeated pages according to any one of claims 1 to 4, wherein after segmenting the image according to the segmentation region to obtain a plurality of segments, the method further comprises:

performing color space conversion on each image block, and converting from an RGB space to an HSV space;

extracting color features of the image blocks after color space conversion;

and taking the extracted color features and the area ratio of the image blocks in the image as image features.

7. The method for detecting content of repeated pages according to claim 6, wherein said determining whether content of repeated pages exists according to similarity degree between tiles belonging to the same cluster comprises:

for two image blocks belonging to the same cluster, if the image similarity between the two image blocks is greater than a first similarity and the character similarity of the two image blocks is greater than a second similarity, determining that the repeated page content exists;

and if the picture similarity between the two image blocks is not greater than the first similarity, or the character similarity of the two image blocks is not greater than the second similarity, determining that the repeated page content does not exist.

8. An apparatus for duplicate page content detection, the apparatus comprising:

the detection module is used for determining whether repeated page content exists according to the similarity degree between the image blocks belonging to the same cluster;

the detection module is also used for determining the picture similarity degree between two image blocks belonging to the same cluster; wherein, include:

9. A computer device, comprising: memory, processor and computer program stored on the memory and executable on the processor, the processor implementing a method of duplicate page content detection as claimed in any of claims 1 to 7 when executing the program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method for duplicate page content detection as claimed in any one of claims 1 to 7.