CN113239256B - Method for generating website signature, method and device for identifying website - Google Patents

Method for generating website signature, method and device for identifying website Download PDF

Info

Publication number
CN113239256B
CN113239256B CN202110531404.3A CN202110531404A CN113239256B CN 113239256 B CN113239256 B CN 113239256B CN 202110531404 A CN202110531404 A CN 202110531404A CN 113239256 B CN113239256 B CN 113239256B
Authority
CN
China
Prior art keywords
website
signature
page
pages
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110531404.3A
Other languages
Chinese (zh)
Other versions
CN113239256A (en
Inventor
刘伟
余文利
张博
林赛群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202110531404.3A priority Critical patent/CN113239256B/en
Publication of CN113239256A publication Critical patent/CN113239256A/en
Application granted granted Critical
Publication of CN113239256B publication Critical patent/CN113239256B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Abstract

The invention discloses a method for generating website signatures, a method and a device for identifying websites, relates to the technical field of computers, and particularly relates to the technical field of website identification. The specific implementation scheme is as follows: a method of generating a website signature, comprising: determining page types of a plurality of pages according to Uniform Resource Locators (URLs) of the plurality of pages of a target website, wherein the page types at least comprise index types and content types; determining theme information for pages of index types in the plurality of pages; determining a page template aiming at pages of content types in a plurality of pages; and generating a website signature of the target website according to the page type, the theme information and the page template.

Description

Method for generating website signature, method and device for identifying website
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to the field of website identification technologies.
Background
The internet produces a huge amount of data every day, among which there is a huge amount of junk data. Wherein more junk data is contained in the cheating website. In order to promote the ordering of web pages in a search engine, a designer of a cheating website adopts a series of practice of cheating the search engine, so that the cheating website contains a lot of meaningless junk data, and the pages seen by users are inconsistent with the content captured by the search engine. These cheating sites seriously affect the internet ecological environment and infringe the rights and interests of the user.
Disclosure of Invention
The present disclosure provides a method for generating website signature, a method, an apparatus, a device and a storage medium for identifying websites.
According to an aspect of the present disclosure, there is provided a method of generating a website signature, including: determining page types of a plurality of pages of a target website according to Uniform Resource Locators (URLs) of the pages, wherein the page types at least comprise an index type and a content type; determining theme information for pages of index types in the plurality of pages; determining a page template for pages of the content types in the plurality of pages; and generating a website signature of the target website according to the page type, the theme information and the page template.
According to another aspect of the present disclosure, there is provided a method of identifying a website, comprising: acquiring a website signature of a first website and a website signature of a second website; determining whether the first website and the second website are the same website according to the website signature of the first website and the website signature of the second website; wherein the website signature of the first website and the website signature of the second website are generated according to the method for generating the website signature of the embodiment of the disclosure.
Another aspect of the present disclosure provides a computing device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods shown in the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided an apparatus for generating a website signature, including: the page type determining module is used for determining page types of a plurality of pages of a target website according to Uniform Resource Locators (URLs) of the pages, wherein the page types at least comprise an index type and a content type; the topic information determining module is used for determining topic information aiming at the index type page in the pages; the page template determining module is used for determining a page template aiming at the page of the content type in the plurality of pages; and the website signature generation module is used for generating the website signature of the target website according to the page type, the theme information and the page template.
According to another aspect of the present disclosure, there is provided an apparatus for identifying a website, including: the acquisition module is used for acquiring the website signature of the first website and the website signature of the second website; the identification module is used for determining whether the first website and the second website are the same website according to the website signature of the first website and the website signature of the second website; wherein the website signature of the first website and the website signature of the second website are generated according to the method for generating the website signature of the embodiment of the disclosure.
According to another aspect of the disclosed embodiments, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the methods shown in the disclosed embodiments.
According to another aspect of the disclosed embodiments, there is provided a computer program product, a computer program, which when executed by a processor, implements the method shown in the disclosed embodiments.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 schematically illustrates a flow diagram of a method of generating a website signature in accordance with an embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow diagram of a method of determining topic information in accordance with an embodiment of the present disclosure;
FIG. 3 schematically illustrates a schematic diagram of a page of content types according to an embodiment of the disclosure;
FIG. 4 schematically illustrates a flow diagram of a method of determining a page template in accordance with an embodiment of the present disclosure;
FIG. 5 schematically illustrates a schematic diagram of an index-type page according to an embodiment of the present disclosure;
FIG. 6 schematically illustrates a flow diagram of a method of generating a website signature for a target website from a page type, topic information, and page templates, in accordance with an embodiment of the present disclosure;
FIG. 7 schematically illustrates a flow diagram of a method of identifying websites in accordance with an embodiment of the present disclosure;
FIG. 8 schematically illustrates a schematic block diagram of an apparatus for generating website signatures in accordance with an embodiment of the present disclosure;
FIG. 9 schematically illustrates a schematic block diagram of an apparatus for identifying websites in accordance with an embodiment of the present disclosure; and
FIG. 10 illustrates a schematic block diagram of an example electronic device that may be used to implement the method of generating a website signature and/or the method of identifying a website of embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The method of generating the website signature will be described in detail with reference to fig. 1.
Fig. 1 schematically illustrates a flow diagram of a method of generating a website signature according to an embodiment of the present disclosure.
As shown in FIG. 1, the method 100 of generating a website signature may include operations S110-S140.
In operation S110, page types of a plurality of pages of a target web site are determined according to URLs (uniform resource locators) of the plurality of pages.
According to an embodiment of the present disclosure, the page type includes at least an index type and a content type. Wherein, the index type web pages can be used for indexing other web pages, and the content type web pages can be used for displaying specific content information.
According to embodiments of the present disclosure, an initial URL of a target website may be obtained, where the initial URL may include, for example, a URL of a home page of the target website. The initial URLs are then traversed based on Web crawlers (Web spiders), respectively, to obtain URLs for multiple pages of the target Web site.
According to embodiments of the present disclosure, one or more feature strings are pre-configured for each page type, where the feature strings may be used to match structural features of the URL. The feature string that matches the URL of each of the plurality of pages may be determined. And then determining the page type of each page according to the characteristic character string.
Illustratively, in the present embodiment, the feature strings corresponding to the index types may include, for example: news. The feature string corresponding to the content type may include, for example: news..com.cn/pages/page=.finish..com.cn/pages/page=. Where x represents any character or combination of characters.
For example, for a page with a URL of news, xxx, com, cn, it may be determined that the URL matches the feature string news, com, cn, and accordingly, it may be determined that the page is of index type. For a page with another URL of news, xxx, cn/pages/page=1, it may be determined that the URL matches the feature string news, com, cn/pages/page=, and accordingly, it may be determined that the page is of the content type.
In operation S120, topic information is determined for pages of an index type (or directory type) among a plurality of pages.
According to embodiments of the present disclosure, topic information may be used to represent the topic of a website. Illustratively, in the present embodiment, the subject information may be determined according to meta information (meta information) in the page of the content type.
In operation S130, a page template is determined for a page of a content type among a plurality of pages.
According to embodiments of the present disclosure, a page template may be used to represent a standardized format of a page. Illustratively, in this embodiment, key elements and paths of the key elements in the DOM node tree may be extracted from the content type page, and the page template may be determined according to the paths of the key elements in the DOM node tree. The key elements may include, for example, elements in the page for representing key information, where the key information may include, for example, title, content, source, time, and the like.
In operation S140, a website signature of the target website is generated according to the page type, the topic information and the page template.
According to the embodiment of the disclosure, the website signature of the target website is generated according to the page type, the topic information and the page template, the method for generating the website signature is simpler, and in addition, the website is signed from the granularity of the whole website, so that the website signature is not invalid due to the change of the content of a single page.
According to other embodiments of the present disclosure, the web page types may include other page types in addition to the index type and the content type.
Illustratively, in another embodiment of the present disclosure, the page types may also include a function type and an auxiliary type. Wherein, the feature character string corresponding to the function type may include: login. The feature string corresponding to the auxiliary type may include: abate. Accordingly, when the first signature information is generated, the characteristic strings corresponding to the index type, the content type, the function type and the auxiliary type can be spliced to obtain a first string, and then the first string is signed to obtain the first signature information.
A method of determining the subject information will be described in detail with reference to fig. 2.
Fig. 2 schematically illustrates a flow diagram of a method of determining topic information in accordance with an embodiment of the present disclosure.
As shown in fig. 2, the method 220 of determining theme information may include operations S221 to S223.
Meta information in a page of a content type is acquired in operation S221.
According to the embodiment of the disclosure, the information in the meta tag in the page can be acquired as meta information.
In operation S222, a semantic extraction operation is performed with respect to the meta information, resulting in original text data.
According to the embodiment of the disclosure, the problem of weak whole-station signature compatibility caused by page content change can be at least partially avoided by carrying out semantic extraction on the meta-information.
In operation S223, the original text data is normalized to obtain at least one subject word as subject information.
According to an embodiment of the present disclosure, the normalization processing may include, for example, bigram (binary word segmentation) processing. In this embodiment, the original text data may be divided into a plurality of words composed of two words, i.e., keywords, by performing a bigram process on the original text data. It should be noted that, in the practical application process, other manners may be adopted to normalize the original text data, which is not specifically limited in the present disclosure.
The method of determining subject information is further described below in conjunction with fig. 3 and the embodiments.
Fig. 3 schematically shows a schematic diagram of a page of content types according to an embodiment of the present disclosure. As shown in fig. 3, the page of the content type includes tags of meta, title, etc. Information in all meta tags can be obtained to serve as meta information, and then semantic extraction is carried out on the meta information to obtain the following original text data: "news, current affairs, time, administration, international, domestic, social, legal, focus, comments, culture, education, depth, web comments, topics, world wide, spread, forum, pictures, military, focus, ranking, environmental protection, campus legal, wonder, true emotion". Next, binary word segmentation is performed on the original text data, and theme information is obtained as follows: "news fashion politics and international and domestic social methods are used for focusing comments on cultural education deep network comment special topic world transmission forum picture military focus ranking environmental protection campus method is used for treating singular true feeling".
The method of determining the page template will be described in detail with reference to fig. 4.
Fig. 4 schematically illustrates a flow diagram of a method of determining a page template in accordance with an embodiment of the present disclosure.
As shown in fig. 4, the method 430 for determining a page template may include operations S431 to S432.
In operation S431, at least one key element in the page of the content type and a path of the at least one key element in a DOM (document object model) node tree are determined.
According to embodiments of the present disclosure, elements, i.e., key elements, respectively corresponding to each key information may be determined according to positions (e.g., including relative positions, absolute positions, etc.), styles (e.g., including fonts, colors, etc.) and the like of elements in a page. The path of the key element in the DOM node tree (DOM-tree) may then be obtained.
In operation S432, a page template is determined from the path of at least one key element in the DOM node tree.
According to embodiments of the present disclosure, after obtaining the paths of the key elements in the DOM node tree, all paths may be stored as page templates.
The method of determining the path of the key element is further described below in conjunction with fig. 5 and the embodiments.
FIG. 5 schematically illustrates a schematic diagram of an index-type page according to an embodiment of the present disclosure. As shown in fig. 5, it can be recognized that the page includes key information such as an article title (title) 51, article content (content) 52, article source (source) 53, and release time (time) 54, based on information such as the position, style, and the like of elements in the index type page.
Correspondingly, the path of the key element corresponding to the key information title 51 can be determined through the DOM node tree as follows:
/html/body/main/section[1]/section[1]/article/h1。
the path of the key element corresponding to the key information content 52 is:
/html/body/main/section[1]/section[1]/article/section[2]
the path of the key element corresponding to the key information source 53 is:
/html/body/main/section[1]/section[1]/article/section[1]/div/div/figcaption/figure/h2
the path of the key element corresponding to key information time 54 is:
/html/body/main/section[1]/section[1]/article/section[1]/div/div/figcaption/figure/time/span
a method of generating a website signature of a target website according to the page type, the topic information, and the page template will be described in detail below with reference to fig. 6.
FIG. 6 schematically illustrates a flow diagram of a method of generating a website signature for a target website based on page type, topic information, and page templates, in accordance with an embodiment of the present disclosure.
As shown in fig. 6, the method 640 of generating the website signature of the target website according to the page type, the topic information and the page template may include operations S641 to S644.
In operation S641, first signature information is generated according to page types of the plurality of pages.
According to the embodiment of the disclosure, the characteristic strings are spliced to obtain a first string. A signature operation is then performed on the first string to obtain first signature information.
In operation S642, second signature information is generated according to the subject information.
According to an embodiment of the present disclosure, a signature operation may be performed for each subject word in the subject information, respectively. And splicing the signed at least one subject term to obtain a second character string. And then, performing signature operation on the second character string to obtain second signature information.
In operation S643, third signature information is generated according to the page template.
According to the embodiment of the disclosure, the paths of each key element in the page template in the DOM node tree can be spliced to obtain the third character string. And then, performing signature operation on the third character string to obtain third signature information.
According to the embodiment of the disclosure, the characteristic character strings, the theme information and the information containing invisible character strings in the page template can be conveniently represented by visible characters by respectively signing the characteristic character strings, the theme information and the page template, so that copying, pasting or transmission can be conveniently carried out.
In operation S644, a website signature of the target website is generated according to the first signature information, the second signature information, and the third signature information.
According to the embodiment of the disclosure, the first signature information, the second signature information and the third signature information can be spliced to obtain a fourth character string. And then, performing signature operation on the fourth character string to obtain a website signature.
According to the embodiment of the disclosure, in the process of generating the first signature information, the second signature information, the third signature information and the website signature, the same encoding mode can be adopted, and different encoding modes can also be adopted. Illustratively, in this embodiment, the base64 encoding mode is used when the first signature information, the second signature information, the third signature information, and the website signature are generated. It will be appreciated that other coding schemes besides base64 may be used during actual application, and this disclosure is not limited in detail.
A method of identifying a web site will be described in detail below in conjunction with fig. 7.
Fig. 7 schematically illustrates a flow diagram of a method of identifying websites in accordance with an embodiment of the present disclosure.
As shown in FIG. 7, the method 700 of identifying a website may include operations S710-S720.
In operation S710, a website signature of a first website and a website signature of a second website are acquired.
According to an embodiment of the present disclosure, the website signatures of the first website and the second website are generated by the method of generating website signatures shown above. The specific generation method of the website signatures of the first website and the second website may refer to the above, and will not be described herein.
In operation S720, it is determined whether the first website and the second website are the same website according to the website signature of the first website and the website signature of the second website.
According to embodiments of the present disclosure, a similarity between a website signature of a first website and a website signature of a second website may be calculated. And then determining whether the two websites are the same website according to the similarity. For example, the spatial distance between the website signature of the first website and the website signature of the second website may be calculated as a parameter that measures the similarity between the two website signatures. If the spatial distance is smaller than the threshold value, the first website and the second website are determined to be the same website. Otherwise, the first website and the second website are confirmed not to be the same website.
The method of identifying websites according to embodiments of the present disclosure may be used to identify cheating websites. Because the website signature is based on the whole website granularity, the compatibility is strong, and recognition failure caused by partial web content transmission change of the website is avoided, so that the hitting strength of the cheating website is improved.
The apparatus for generating a website signature will be described in detail with reference to fig. 8.
Fig. 8 schematically illustrates a schematic block diagram of an apparatus for generating website signatures according to an embodiment of the present disclosure.
As shown in fig. 8, the apparatus 800 for generating a website signature may include a page type determining module 810, a topic information determining module 820, a page template determining module 830, and a website signature generating module 840.
The page type determining module 810 may be configured to determine a page type of a plurality of pages of the target website according to uniform resource locators URLs of the plurality of pages, where the page type includes at least an index type and a content type.
The topic information determination module 820 may be configured to determine topic information for a page of an index type in a plurality of pages.
The page template determination module 830 may be configured to determine a page template for a page of the content types in the plurality of pages.
The website signature generation module 840 may be configured to generate a website signature of the target website according to the page type, the topic information and the page template.
The apparatus for identifying a web site will be described in detail with reference to fig. 9.
Fig. 9 schematically illustrates a schematic block diagram of an apparatus for identifying websites according to an embodiment of the present disclosure.
As shown in fig. 9, the apparatus 900 for identifying a website may include an acquisition module 910 and an identification module 920.
The acquiring module 910 may be configured to acquire a website signature of a first website and a website signature of a second website.
The identification module 920 may be configured to determine whether the first website and the second website are the same website according to the website signature of the first website and the website signature of the second website.
Wherein the website signature of the first website and the website signature of the second website are generated according to the method for generating the website signature of the embodiment of the disclosure.
It should be noted that, in the technical solution of the present disclosure, the related website information or the user personal information may be acquired, stored, applied, etc. all conform to the rules of the related laws and regulations, and do not violate the popular public order.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 10 shows a schematic block diagram of an example electronic device 1000 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for the operation of the device 1000 can also be stored. The computing unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.
Various components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and communication unit 1009 such as a network card, modem, wireless communication transceiver, etc. Communication unit 1009 allows device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.
The computing unit 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1001 performs the various methods and processes described above, such as a method of generating a website signature and/or a method of identifying a website. For example, in some embodiments, the method of generating a website signature and/or the method of identifying a website may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communication unit 1009. When the computer program is loaded into RAM 1003 and executed by computing unit 1001, one or more steps of the method of generating a website signature and/or the method of identifying a website described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the method of generating a website signature and/or the method of identifying a website in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (13)

1. A method of generating a website signature, comprising:
determining page types of a plurality of pages of a target website according to Uniform Resource Locators (URLs) of the pages, wherein the page types at least comprise an index type and a content type;
determining theme information for pages of index types in the plurality of pages;
determining a page template for pages of the content types in the plurality of pages; and
generating first signature information according to page types of the plurality of pages;
generating second signature information according to the subject information;
generating third signature information according to the page template;
splicing the first signature information, the second signature information and the third signature information to obtain a fourth character string; and
and executing signature operation on the fourth character string to obtain the website signature of the target website.
2. The method of claim 1, wherein each of the page types corresponds to one or more feature strings; the determining the page types of the pages according to the URLs of the pages of the target website comprises the following steps:
determining a feature string matching the URL of each of the plurality of pages; and
and determining the page type of each page according to the characteristic character string.
3. The method of claim 2, wherein the generating the first signature information according to the page types of the plurality of pages comprises:
splicing the characteristic character strings corresponding to the page types of the plurality of pages to obtain a first character string; and
and executing signature operation on the first character string to obtain the first signature information.
4. The method of claim 1, wherein the determining theme information includes:
acquiring meta information in the page of the content type;
executing semantic extraction operation on the meta information to obtain original text data; and
and carrying out normalization processing on the original text data to obtain at least one subject word serving as the subject information.
5. The method of claim 4, wherein the generating second signature information from the subject information comprises:
performing a signature operation for each of the at least one subject word, respectively;
splicing the signed at least one subject term to obtain a second character string; and
and executing signature operation on the second character string to obtain the second signature information.
6. The method of claim 2, wherein the determining a page template comprises:
determining at least one key element in a page of the content type and a path of the at least one key element in a Document Object Model (DOM) node tree; and
and determining the page template according to the path of the at least one key element in the DOM node tree.
7. The method of claim 6, wherein the generating third signature information from the page template comprises:
splicing paths of each key element in the at least one key element in the DOM node tree according to the page template to obtain a third character string; and
and executing signature operation on the third character string to obtain the third signature information.
8. The method of any of claims 1-7, further comprising:
acquiring an initial URL of the target website; and
traversing the initial URLs based on crawlers respectively to obtain URLs of a plurality of pages of the target website.
9. A method of identifying a web site, comprising:
acquiring a website signature of a first website and a website signature of a second website; and
determining whether the first website and the second website are the same website according to the website signature of the first website and the website signature of the second website;
wherein the website signature of the first website and the website signature of the second website are generated according to the method of any one of claims 1-8.
10. An apparatus for generating a website signature, comprising:
the page type determining module is used for determining page types of a plurality of pages of a target website according to Uniform Resource Locators (URLs) of the pages, wherein the page types at least comprise an index type and a content type;
the topic information determining module is used for determining topic information aiming at the index type page in the pages;
the page template determining module is used for determining a page template aiming at the page of the content type in the plurality of pages; and
a website signature generation module for generating a website signature of the target website according to the page type, the topic information and the page template,
wherein, the website signature generation module is further used for: generating first signature information according to page types of the plurality of pages; generating second signature information according to the subject information; generating third signature information according to the page template; splicing the first signature information, the second signature information and the third signature information to obtain a fourth character string; and executing signature operation on the fourth character string to obtain the website signature of the target website.
11. An apparatus for identifying a web site, comprising:
the acquisition module is used for acquiring the website signature of the first website and the website signature of the second website; and
the identification module is used for determining whether the first website and the second website are the same website according to the website signature of the first website and the website signature of the second website;
wherein the website signature of the first website and the website signature of the second website are generated according to the method of any one of claims 1-8.
12. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
13. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.
CN202110531404.3A 2021-05-14 2021-05-14 Method for generating website signature, method and device for identifying website Active CN113239256B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110531404.3A CN113239256B (en) 2021-05-14 2021-05-14 Method for generating website signature, method and device for identifying website

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110531404.3A CN113239256B (en) 2021-05-14 2021-05-14 Method for generating website signature, method and device for identifying website

Publications (2)

Publication Number Publication Date
CN113239256A CN113239256A (en) 2021-08-10
CN113239256B true CN113239256B (en) 2024-02-23

Family

ID=77134531

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110531404.3A Active CN113239256B (en) 2021-05-14 2021-05-14 Method for generating website signature, method and device for identifying website

Country Status (1)

Country Link
CN (1) CN113239256B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792232B (en) * 2021-09-13 2024-02-27 北京百度网讯科技有限公司 Page feature calculation method, page feature calculation device, electronic equipment, page feature calculation medium and page feature calculation program product
CN114925308B (en) * 2022-04-29 2023-10-03 北京百度网讯科技有限公司 Webpage processing method and device of website, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101777053A (en) * 2009-01-08 2010-07-14 北京搜狗科技发展有限公司 Method and system for identifying cheating webpages
CN103684896A (en) * 2012-09-07 2014-03-26 中国科学院计算机网络信息中心 Method of detecting website cheating based on domain name resolution characteristics
CN103685308A (en) * 2013-12-25 2014-03-26 北京奇虎科技有限公司 Detection method and system of phishing web pages, client and server
US9386037B1 (en) * 2015-09-16 2016-07-05 RiskIQ Inc. Using hash signatures of DOM objects to identify website similarity
CN108304584A (en) * 2018-03-06 2018-07-20 百度在线网络技术(北京)有限公司 Illegal page detection method, apparatus, intruding detection system and storage medium
CN109347786A (en) * 2018-08-14 2019-02-15 国家计算机网络与信息安全管理中心 Detection method for phishing site
CN111753171A (en) * 2020-06-09 2020-10-09 北京天空卫士网络安全技术有限公司 Malicious website identification method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100485321B1 (en) * 2003-04-04 2005-04-27 엔에이치엔(주) A method of managing web sites registered in search engine and a system thereof
US8701185B2 (en) * 2008-10-14 2014-04-15 At&T Intellectual Property I, L.P. Method for locating fraudulent replicas of web sites
US9553783B2 (en) * 2012-09-14 2017-01-24 Salesforce.Com, Inc. Spam flood detection methodologies

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101777053A (en) * 2009-01-08 2010-07-14 北京搜狗科技发展有限公司 Method and system for identifying cheating webpages
CN103684896A (en) * 2012-09-07 2014-03-26 中国科学院计算机网络信息中心 Method of detecting website cheating based on domain name resolution characteristics
CN103685308A (en) * 2013-12-25 2014-03-26 北京奇虎科技有限公司 Detection method and system of phishing web pages, client and server
US9386037B1 (en) * 2015-09-16 2016-07-05 RiskIQ Inc. Using hash signatures of DOM objects to identify website similarity
CN108304584A (en) * 2018-03-06 2018-07-20 百度在线网络技术(北京)有限公司 Illegal page detection method, apparatus, intruding detection system and storage medium
CN109347786A (en) * 2018-08-14 2019-02-15 国家计算机网络与信息安全管理中心 Detection method for phishing site
CN111753171A (en) * 2020-06-09 2020-10-09 北京天空卫士网络安全技术有限公司 Malicious website identification method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于SVM-RFE的钓鱼网页检测方法研究;王婷;彭勇;戴忠华;伊胜伟;韩兰胜;;华中科技大学学报(自然科学版)(S2);150-153 *
搜索引擎反作弊方法研究;王庆福;王兴国;;电脑知识与技术(15);208-209 *
钓鱼网站的鉴别方法与防范策略研究;郭萍;;浙江纺织服装职业技术学院学报(03);70-75 *

Also Published As

Publication number Publication date
CN113239256A (en) 2021-08-10

Similar Documents

Publication Publication Date Title
CN111625635B (en) Question-answering processing method, device, equipment and storage medium
US8909703B2 (en) System and computer program product for facilitating a real-time virtual interaction
US9507867B2 (en) Discovery engine
CN113239256B (en) Method for generating website signature, method and device for identifying website
CN117056471A (en) Knowledge base construction method and question-answer dialogue method and system based on generation type large language model
US20210248498A1 (en) Method and apparatus for training pre-trained knowledge model, and electronic device
CN111552797B (en) Name prediction model training method and device, electronic equipment and storage medium
CN107958078A (en) Information generating method and device
CN112580331A (en) Method and system for establishing knowledge graph of policy text
CN111552798B (en) Name information processing method and device based on name prediction model and electronic equipment
CN111177404A (en) Knowledge graph construction method and device of home decoration knowledge and computer equipment
CN104778232A (en) Searching result optimizing method and device based on long query
CN111666417A (en) Method and device for generating synonyms, electronic equipment and readable storage medium
CN116561298A (en) Title generation method, device, equipment and storage medium based on artificial intelligence
CN111552890B (en) Name information processing method and device based on name prediction model and electronic equipment
JP2024507029A (en) Web page identification methods, devices, electronic devices, media and computer programs
CN111931524B (en) Method, apparatus, device and storage medium for outputting information
CN114218431A (en) Video searching method and device, electronic equipment and storage medium
CN108664535B (en) Information output method and device
CN116248375B (en) Webpage login entity identification method, device, equipment and storage medium
CN115203620B (en) Interface migration-oriented webpage identification method, device and equipment with similar semantic theme
CN112445961B (en) Information processing method, device, terminal equipment and storage medium
CN113377922B (en) Method, device, electronic equipment and medium for matching information
CN113407890B (en) Information extraction method, device, electronic equipment and medium
CN117289869A (en) Data processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant