CN113239256A - Method for generating website signature and method and device for identifying website - Google Patents

Method for generating website signature and method and device for identifying website Download PDF

Info

Publication number
CN113239256A
CN113239256A CN202110531404.3A CN202110531404A CN113239256A CN 113239256 A CN113239256 A CN 113239256A CN 202110531404 A CN202110531404 A CN 202110531404A CN 113239256 A CN113239256 A CN 113239256A
Authority
CN
China
Prior art keywords
website
signature
page
information
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110531404.3A
Other languages
Chinese (zh)
Other versions
CN113239256B (en
Inventor
刘伟
余文利
张博
林赛群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202110531404.3A priority Critical patent/CN113239256B/en
Publication of CN113239256A publication Critical patent/CN113239256A/en
Application granted granted Critical
Publication of CN113239256B publication Critical patent/CN113239256B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The disclosure discloses a method for generating a website signature, a method and a device for identifying a website, and relates to the technical field of computers, in particular to the technical field of website identification. The specific implementation scheme is as follows: a method of generating a website signature, comprising: determining page types of a plurality of pages according to Uniform Resource Locators (URLs) of the pages of the target website, wherein the page types at least comprise an index type and a content type; determining subject information for an index type page of a plurality of pages; determining a page template aiming at a page with a content type in a plurality of pages; and generating the website signature of the target website according to the page type, the subject information and the page template.

Description

Method for generating website signature and method and device for identifying website
Technical Field
The present disclosure relates to the field of computer technology, and more particularly, to the field of website identification technology.
Background
The internet produces a large amount of data every day, and a large amount of junk data exists in the large amount of data. Wherein, more junk data are contained in the cheating website. Designers of cheating websites adopt a series of methods for deceiving the search engine in order to promote the ranking of the webpages in the search engine, so that the cheating websites contain a lot of meaningless junk data, and the pages seen by users are inconsistent with the contents captured by the search engine. These cheating websites seriously affect the ecological environment of the internet and infringe the rights and interests of users.
Disclosure of Invention
The present disclosure provides a method for generating a website signature, a method, an apparatus, a device and a storage medium for identifying a website.
According to an aspect of the present disclosure, there is provided a method of generating a website signature, including: determining page types of a plurality of pages of a target website according to Uniform Resource Locators (URLs) of the pages, wherein the page types at least comprise an index type and a content type; determining subject information for an index type page of the plurality of pages; determining a page template aiming at the page with the content type in the plurality of pages; and generating the website signature of the target website according to the page type, the subject information and the page template.
According to another aspect of the present disclosure, there is provided a method of identifying a website, including: acquiring a website signature of a first website and a website signature of a second website; determining whether the first website and the second website are the same website or not according to the website signature of the first website and the website signature of the second website; the website signature of the first website and the website signature of the second website are generated according to the method for generating the website signature of the embodiment of the disclosure.
Another aspect of the disclosure provides a computing device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided an apparatus for generating a website signature, including: the system comprises a page type determining module, a page type determining module and a content type determining module, wherein the page type determining module is used for determining the page types of a plurality of pages of a target website according to Uniform Resource Locators (URLs) of the pages, and the page types at least comprise index types and content types; the theme information determining module is used for determining theme information aiming at the index type pages in the multiple pages; the page template determining module is used for determining a page template aiming at the page of the content type in the plurality of pages; and the website signature generating module is used for generating the website signature of the target website according to the page type, the subject information and the page template.
According to another aspect of the present disclosure, there is provided an apparatus for identifying a website, including: the acquisition module is used for acquiring a website signature of a first website and a website signature of a second website; the identification module is used for determining whether the first website and the second website are the same website or not according to the website signature of the first website and the website signature of the second website; the website signature of the first website and the website signature of the second website are generated according to the method for generating the website signature of the embodiment of the disclosure.
According to another aspect of the disclosed embodiments, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method shown in the disclosed embodiments.
According to another aspect of the embodiments of the present disclosure, there is provided a computer program product, a computer program, which when executed by a processor implements the method shown in the embodiments of the present disclosure.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 schematically shows a flow diagram of a method of generating a website signature according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow diagram of a method of determining topic information in accordance with an embodiment of the present disclosure;
FIG. 3 schematically shows a schematic diagram of a page of content types according to an embodiment of the disclosure;
FIG. 4 schematically illustrates a flow diagram of a method of determining a page template according to an embodiment of the disclosure;
FIG. 5 schematically illustrates a diagram of an index-type page according to an embodiment of the disclosure;
FIG. 6 is a schematic flow chart diagram illustrating a method for generating a website signature of a target website according to a page type, topic information and a page template according to an embodiment of the present disclosure;
FIG. 7 schematically illustrates a flow diagram of a method of identifying a website according to an embodiment of the present disclosure;
FIG. 8 schematically shows a schematic block diagram of an apparatus for generating a website signature according to an embodiment of the present disclosure;
FIG. 9 schematically shows a schematic block diagram of an apparatus for identifying a website according to an embodiment of the present disclosure; and fig. 10 shows a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The method of generating web site signatures is described in detail below in conjunction with fig. 1.
Fig. 1 schematically shows a flow diagram of a method of generating a website signature according to an embodiment of the present disclosure.
As shown in fig. 1, the method 100 of generating a website signature may include operations S110 to S140.
In operation S110, a page type of a plurality of pages is determined according to URLs (uniform resource locators) of the plurality of pages of the target website.
According to an embodiment of the present disclosure, the page types include at least an index type and a content type. The index-type web page may be used to index other web pages, and the content-type web page may be used to display specific content information.
According to an embodiment of the present disclosure, an initial URL of a target website may be obtained, where the initial URL may include, for example, a URL of a home page of the target website. Then, the initial URL is traversed based on a Web crawler (Web spider) respectively to obtain URLs of a plurality of pages of the target website.
According to the embodiment of the disclosure, one or more characteristic character strings are configured in advance for each page type, wherein the characteristic character strings can be used for matching the structural characteristics of the URL. The characteristic string may be determined to match the URL of each of the plurality of pages. And then determining the page type of each page according to the characteristic character strings.
In this embodiment, for example, the characteristic character string corresponding to the index type may include: com.cn, finish, com.cn, sports, com.cn, and the like. The characteristic character string corresponding to the content type may include, for example: com.cn/pages/page id ═ fine. Where denotes any character or combination of characters.
For example, for a page whose URL is news. For another page with URL of news, xxx, com, pages/pageid 1, it may be determined that the URL matches the characteristic string news, com, cn, pages/pageid, and accordingly, the page may be determined to be of the content type.
In operation S120, topic information is determined for a page of an index type (or directory type) in a plurality of pages.
According to an embodiment of the present disclosure, the topic information may be used to represent a topic of a website. Exemplarily, in the present embodiment, the topic information may be determined according to meta information (meta information) in a page of a content type.
In operation S130, a page template is determined for a page of a content type among the plurality of pages.
According to embodiments of the present disclosure, a page template may be used to represent a standardized format of a page. For example, in this embodiment, the key elements and the paths of the key elements in the DOM node tree may be extracted from the page of the content type, and the page template may be determined according to the paths of the key elements in the DOM node tree. The key elements may include, for example, elements in a page used for representing key information, where the key information may include, for example, title (title), content (content), source (source), time (time), and other information.
In operation S140, a website signature of the target website is generated according to the page type, the topic information, and the page template.
According to the embodiment of the disclosure, the website signature of the target website is generated according to the page type, the subject information and the page template, the method for generating the website signature is simple, and in addition, the website is signed from the granularity of the whole website, so that the website signature failure caused by the change of the content of a single page is avoided.
According to other embodiments of the present disclosure, the web page types may include other page types in addition to the index type and the content type.
Illustratively, in another embodiment of the present disclosure, the page type may further include a function type and an auxiliary type. Wherein, the characteristic character string corresponding to the function type may include: com. The feature string corresponding to the auxiliary type may include: com. Correspondingly, when the first signature information is generated, the feature character strings corresponding to the index type, the content type, the function type and the auxiliary type may be spliced to obtain a first character string, and then the first character string is signed to obtain the first signature information.
The method of determining the subject information will be described in detail below with reference to fig. 2.
Fig. 2 schematically shows a flow diagram of a method of determining topic information according to an embodiment of the present disclosure.
As shown in fig. 2, the method 220 of determining topic information may include operations S221 to S223.
In operation S221, meta information in a page of a content type is acquired.
According to an embodiment of the present disclosure, information within a meta tag in a page may be acquired as meta information.
In operation S222, a semantic extraction operation is performed on the meta information, resulting in original text data.
According to the embodiment of the disclosure, the problem of poor compatibility of the signature of the whole station caused by the change of the page content can be at least partially avoided by performing semantic extraction on the meta-information.
In operation S223, the raw text data is normalized to obtain at least one topic word as topic information.
According to an embodiment of the present disclosure, the normalization process may include bigram (binary segmentation) process, for example. In this embodiment, bigram processing may be performed on the original text data to divide the original text data into a plurality of words composed of two words, that is, keywords. It should be noted that, in the process of practical application, other manners may also be adopted to perform normalization processing on the original text data, which is not specifically limited in this disclosure.
The method for determining the subject information is further described below with reference to fig. 3 and the specific embodiment.
FIG. 3 schematically shows a schematic diagram of a page of content types according to an embodiment of the disclosure. As shown in fig. 3, the page of the content type includes tags of meta, title, and the like. All information in the meta tags can be acquired as meta information, and then the meta information is subjected to semantic extraction to obtain the following original text data: "news, current affairs, politics, international, domestic, social, treatment, focus, comment, culture, education, depth, online comment, topical, global, dissemination, forum, picture, military, focus, platoon, environmental protection, campus treatment, fantasy, true feelings". Then, binary word segmentation is performed on the original text data to obtain the subject information as follows: "national and national sociological regulations focus on commenting cultural education deep online assessment special topic environmental protection forum picture military focus arrange line environmental protection campus law on true feeling of news at the time of news".
The method of determining the page template will be described in detail below with reference to fig. 4.
Fig. 4 schematically shows a flow diagram of a method of determining a page template according to an embodiment of the present disclosure.
As shown in fig. 4, the method 430 of determining a page template may include operations S431 to S432.
In operation S431, at least one key element and a path of the at least one key element in the page of the content type in a DOM (document object model) node tree are determined.
According to an embodiment of the present disclosure, elements respectively corresponding to each key information, i.e., key elements, may be determined according to positions (including, for example, relative positions, absolute positions, etc.), styles (including, for example, fonts, colors, etc.) and the like of the elements in the page. The path of the key element in the DOM node tree (DOM-tree) may then be obtained.
In operation S432, a page template is determined according to a path of at least one key element in the DOM node tree.
According to an embodiment of the present disclosure, after obtaining the paths of the key elements in the DOM node tree, all the paths may be stored as page templates.
The method for determining the path of the critical element is further described below with reference to fig. 5 and the specific embodiment.
FIG. 5 schematically shows a schematic diagram of an index-type page according to an embodiment of the disclosure. As shown in fig. 5, it can be identified that the page includes key information such as article title (title)51, article content (content)52, article source (source)53, and publication time (time)54 according to information such as position, style, etc. of elements in the index-type page.
Correspondingly, the path of the key element corresponding to the key information title 51 can be determined as follows through the DOM node tree:
/html/body/main/section[1]/section[1]/article/h1。
the path of the key element corresponding to the key information content 52 is:
/html/body/main/section[1]/section[1]/article/section[2]
the path of the key element corresponding to the key information source 53 is:
/html/body/main/section[1]/section[1]/article/section[1]/div/div/figcaption/figure/h2
the path of the key element corresponding to the key information time 54 is:
/html/body/main/section[1]/section[1]/article/section[1]/div/div/figcaption/figure/time/span
a method for generating a website signature of a target website according to a page type, topic information, and a page template will be described in detail below with reference to fig. 6.
Fig. 6 is a flowchart schematically illustrating a method for generating a website signature of a target website according to a page type, topic information and a page template according to an embodiment of the present disclosure.
As shown in fig. 6, the method 640 of generating a website signature of a target website according to a page type, topic information, and a page template may include operations S641 through S644.
In operation S641, first signature information is generated according to the page types of the plurality of pages.
According to the embodiment of the disclosure, the characteristic character strings are spliced to obtain the first character string. Then, a signature operation is performed on the first character string to obtain first signature information.
In operation S642, second signature information is generated according to the topic information.
According to an embodiment of the present disclosure, a signature operation may be performed separately for each topic word in the topic information. And splicing at least one signed subject word to obtain a second character string. And then executing signature operation on the second character string to obtain second signature information.
In operation S643, third signature information is generated according to the page template.
According to the embodiment of the disclosure, the paths of each key element in the page template in the DOM node tree can be spliced to obtain the third character string. And then executing signature operation on the third character string to obtain third signature information.
According to the embodiment of the disclosure, the signature is respectively carried out on the characteristic character string, the subject information and the page template, so that the information containing the invisible character string in the characteristic character string, the subject information and the page template can be conveniently represented by visible characters, and the information can be conveniently copied, pasted or transmitted.
In operation S644, a website signature of the target website is generated according to the first signature information, the second signature information, and the third signature information.
According to the embodiment of the disclosure, the first signature information, the second signature information and the third signature information may be spliced to obtain the fourth character string. And then executing signature operation on the fourth character string to obtain a website signature.
According to the embodiments of the present disclosure, the same encoding method may be used in the process of generating the first signature information, the second signature information, the third signature information, and the website signature, or different encoding methods may be used. In this embodiment, the encoding method of base64 is used for generating the first signature information, the second signature information, the third signature information, and the website signature. It is understood that, in the practical application, other coding manners besides base64 may be adopted, and the present disclosure is not limited thereto.
The method of identifying a website will be described in detail below with reference to fig. 7.
Fig. 7 schematically shows a flowchart of a method of identifying a website according to an embodiment of the present disclosure.
As shown in fig. 7, the method 700 of identifying a website may include operations S710 to S720.
In operation S710, a website signature of a first website and a website signature of a second website are acquired.
According to an embodiment of the present disclosure, the website signatures of the first website and the second website are generated by the method for generating website signatures shown above. For a specific generation method of the website signatures of the first website and the second website, reference may be made to the above description, which is not repeated herein.
In operation S720, it is determined whether the first website and the second website are the same website according to the website signature of the first website and the website signature of the second website.
According to an embodiment of the present disclosure, a similarity between a website signature of a first website and a website signature of a second website may be calculated. And then determining whether the two websites are the same website according to the similarity. For example, a spatial distance between a website signature of a first website and a website signature of a second website may be calculated as a parameter for measuring similarity between the two website signatures. And if the spatial distance is smaller than the threshold value, determining that the first website and the second website are the same website. Otherwise, the first website and the second website are not the same website.
The method of identifying a website according to an embodiment of the present disclosure may be used to identify a cheating website. Because the website signature is based on the whole website granularity, the compatibility is strong, the identification failure caused by the transmission change of partial webpage content of the website can not be caused, and the attack of cheating websites can be improved.
The means for generating the web site signatures will be described in detail below in connection with fig. 8.
Fig. 8 schematically shows a schematic block diagram of an apparatus for generating a website signature according to an embodiment of the present disclosure.
As shown in fig. 8, the apparatus 800 for generating a website signature may include a page type determining module 810, a topic information determining module 820, a page template determining module 830, and a website signature generating module 840.
The page type determining module 810 may be configured to determine page types of a plurality of pages of the target website according to Uniform Resource Locators (URLs) of the plurality of pages, where the page types include at least an index type and a content type.
The topic information determining module 820 may be configured to determine topic information for an index type page of the plurality of pages.
The page template determining module 830 may be configured to determine a page template for a page of a content type in a plurality of pages.
The website signature generating module 840 may be configured to generate a website signature of a target website according to the page type, the topic information, and the page template.
The means for identifying a website will be described in detail below with reference to fig. 9.
Fig. 9 schematically shows a schematic block diagram of an apparatus for identifying a website according to an embodiment of the present disclosure.
As shown in fig. 9, the apparatus 900 for identifying a website may include an obtaining module 910 and an identifying module 920.
The obtaining module 910 may be configured to obtain a website signature of a first website and a website signature of a second website.
The identifying module 920 may be configured to determine whether the first website and the second website are the same website according to the website signature of the first website and the website signature of the second website.
The website signature of the first website and the website signature of the second website are generated according to the method for generating the website signature of the embodiment of the disclosure.
In the technical scheme of the present disclosure, the acquisition, storage, application, and the like of the related website information or the personal information of the user all conform to the regulations of related laws and regulations, and do not violate the customs of the public order.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 10 illustrates a schematic block diagram of an example electronic device 1000 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the device 1000 can also be stored. The calculation unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.
A number of components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and a communication unit 1009 such as a network card, a modem, a wireless communication transceiver, or the like. The communication unit 1009 allows the device 1000 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
Computing unit 1001 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1001 executes the respective methods and processes described above, such as the method of generating a website signature and/or the method of identifying a website. For example, in some embodiments, the method of generating a website signature and/or the method of identifying a website may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communications unit 1009. When the computer program is loaded into RAM 1003 and executed by the computing unit 1001, one or more steps of the method of generating a website signature and/or the method of identifying a website described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured by any other suitable means (e.g., by means of firmware) to perform the method of generating a website signature and/or the method of identifying a website.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (16)

1. A method of generating a website signature, comprising:
determining page types of a plurality of pages of a target website according to Uniform Resource Locators (URLs) of the pages, wherein the page types at least comprise an index type and a content type;
determining subject information for an index type page of the plurality of pages;
determining a page template aiming at the page with the content type in the plurality of pages; and
and generating the website signature of the target website according to the page type, the subject information and the page template.
2. The method of claim 1, wherein generating a website signature for the target website based on the page type, the subject information, and the page template comprises:
generating first signature information according to the page types of the multiple pages;
generating second signature information according to the subject information;
generating third signature information according to the page template; and
and generating a website signature of the target website according to the first signature information, the second signature information and the third signature information.
3. The method of claim 2, wherein each of the page types corresponds to one or more characteristic strings; the determining the page types of the multiple pages according to the URLs of the multiple pages of the target website comprises the following steps:
determining a characteristic string matching the URL of each of the plurality of pages; and
and determining the page type of each page according to the characteristic character string.
4. The method of claim 3, wherein the generating first signature information according to the page types of the plurality of pages comprises:
splicing the characteristic character strings corresponding to the page types of the multiple pages to obtain a first character string; and
and executing signature operation on the first character string to obtain the first signature information.
5. The method of claim 2, wherein the determining subject information comprises:
acquiring meta information in the page of the content type;
performing semantic extraction operation on the meta information to obtain original text data; and
and carrying out normalization processing on the original text data to obtain at least one subject term as the subject information.
6. The method of claim 5, wherein the generating second signature information from the subject information comprises:
executing a signature operation respectively aiming at each subject term in the at least one subject term;
splicing the signed subject term to obtain a second character string; and
and executing signature operation aiming at the second character string to obtain the second signature information.
7. The method of claim 2, wherein the determining a page template comprises:
determining at least one key element in the page of the content type and a path of the at least one key element in a Document Object Model (DOM) node tree; and
and determining the page template according to the path of the at least one key element in the DOM node tree.
8. The method of claim 7, wherein the generating third signature information from the page template comprises:
according to the page template, splicing the path of each key element in the at least one key element in the DOM node tree to obtain a third character string; and
and executing signature operation aiming at the third character string to obtain the third signature information.
9. The method of claim 2, generating a website signature of the target website according to the first signature information, the second signature information, and the third signature information, comprising:
splicing the first signature information, the second signature information and the third signature information to obtain a fourth character string; and
and executing signature operation aiming at the fourth character string to obtain the website signature.
10. The method according to any one of claims 1-9, further comprising:
acquiring an initial URL of the target website; and
and traversing the initial URL respectively based on a crawler to obtain the URLs of a plurality of pages of the target website.
11. A method of identifying a website, comprising:
acquiring a website signature of a first website and a website signature of a second website; and
determining whether the first website and the second website are the same website or not according to the website signature of the first website and the website signature of the second website;
wherein the website signature of the first website and the website signature of the second website are generated according to the method of any one of claims 1-10.
12. An apparatus for generating a website signature, comprising:
the system comprises a page type determining module, a page type determining module and a content type determining module, wherein the page type determining module is used for determining the page types of a plurality of pages of a target website according to Uniform Resource Locators (URLs) of the pages, and the page types at least comprise index types and content types;
the theme information determining module is used for determining theme information aiming at the index type pages in the multiple pages;
the page template determining module is used for determining a page template aiming at the page of the content type in the plurality of pages; and
and the website signature generating module is used for generating the website signature of the target website according to the page type, the subject information and the page template.
13. An apparatus for identifying a website, comprising:
the acquisition module is used for acquiring a website signature of a first website and a website signature of a second website; and
the identification module is used for determining whether the first website and the second website are the same website or not according to the website signature of the first website and the website signature of the second website;
wherein the website signature of the first website and the website signature of the second website are generated according to the method of any one of claims 1-11.
14. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-11.
15. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-11.
16. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-11.
CN202110531404.3A 2021-05-14 2021-05-14 Method for generating website signature, method and device for identifying website Active CN113239256B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110531404.3A CN113239256B (en) 2021-05-14 2021-05-14 Method for generating website signature, method and device for identifying website

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110531404.3A CN113239256B (en) 2021-05-14 2021-05-14 Method for generating website signature, method and device for identifying website

Publications (2)

Publication Number Publication Date
CN113239256A true CN113239256A (en) 2021-08-10
CN113239256B CN113239256B (en) 2024-02-23

Family

ID=77134531

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110531404.3A Active CN113239256B (en) 2021-05-14 2021-05-14 Method for generating website signature, method and device for identifying website

Country Status (1)

Country Link
CN (1) CN113239256B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792232A (en) * 2021-09-13 2021-12-14 北京百度网讯科技有限公司 Page feature calculation method, device, electronic equipment, medium and program product
CN114925308A (en) * 2022-04-29 2022-08-19 北京百度网讯科技有限公司 Website webpage processing method and device, electronic equipment and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060218168A1 (en) * 2003-04-04 2006-09-28 Na Sun K Method of managing web sites registered in search engine and a system thereof
US20100095375A1 (en) * 2008-10-14 2010-04-15 Balachander Krishnamurthy Method for locating fraudulent replicas of web sites
CN101777053A (en) * 2009-01-08 2010-07-14 北京搜狗科技发展有限公司 Method and system for identifying cheating webpages
US20140082182A1 (en) * 2012-09-14 2014-03-20 Salesforce.Com, Inc. Spam flood detection methodologies
CN103684896A (en) * 2012-09-07 2014-03-26 中国科学院计算机网络信息中心 Method of detecting website cheating based on domain name resolution characteristics
CN103685308A (en) * 2013-12-25 2014-03-26 北京奇虎科技有限公司 Detection method and system of phishing web pages, client and server
US9386037B1 (en) * 2015-09-16 2016-07-05 RiskIQ Inc. Using hash signatures of DOM objects to identify website similarity
CN108304584A (en) * 2018-03-06 2018-07-20 百度在线网络技术(北京)有限公司 Illegal page detection method, apparatus, intruding detection system and storage medium
CN109347786A (en) * 2018-08-14 2019-02-15 国家计算机网络与信息安全管理中心 Detection method for phishing site
CN111753171A (en) * 2020-06-09 2020-10-09 北京天空卫士网络安全技术有限公司 Malicious website identification method and device

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060218168A1 (en) * 2003-04-04 2006-09-28 Na Sun K Method of managing web sites registered in search engine and a system thereof
US20100095375A1 (en) * 2008-10-14 2010-04-15 Balachander Krishnamurthy Method for locating fraudulent replicas of web sites
CN101777053A (en) * 2009-01-08 2010-07-14 北京搜狗科技发展有限公司 Method and system for identifying cheating webpages
CN103684896A (en) * 2012-09-07 2014-03-26 中国科学院计算机网络信息中心 Method of detecting website cheating based on domain name resolution characteristics
US20140082182A1 (en) * 2012-09-14 2014-03-20 Salesforce.Com, Inc. Spam flood detection methodologies
CN103685308A (en) * 2013-12-25 2014-03-26 北京奇虎科技有限公司 Detection method and system of phishing web pages, client and server
US9386037B1 (en) * 2015-09-16 2016-07-05 RiskIQ Inc. Using hash signatures of DOM objects to identify website similarity
CN108304584A (en) * 2018-03-06 2018-07-20 百度在线网络技术(北京)有限公司 Illegal page detection method, apparatus, intruding detection system and storage medium
CN109347786A (en) * 2018-08-14 2019-02-15 国家计算机网络与信息安全管理中心 Detection method for phishing site
CN111753171A (en) * 2020-06-09 2020-10-09 北京天空卫士网络安全技术有限公司 Malicious website identification method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
王婷;彭勇;戴忠华;伊胜伟;韩兰胜;: "基于SVM-RFE的钓鱼网页检测方法研究", 华中科技大学学报(自然科学版), no. 2, pages 150 - 153 *
王庆福;王兴国;: "搜索引擎反作弊方法研究", 电脑知识与技术, no. 15, pages 208 - 209 *
郭萍;: "钓鱼网站的鉴别方法与防范策略研究", 浙江纺织服装职业技术学院学报, no. 03, pages 70 - 75 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792232A (en) * 2021-09-13 2021-12-14 北京百度网讯科技有限公司 Page feature calculation method, device, electronic equipment, medium and program product
CN113792232B (en) * 2021-09-13 2024-02-27 北京百度网讯科技有限公司 Page feature calculation method, page feature calculation device, electronic equipment, page feature calculation medium and page feature calculation program product
CN114925308A (en) * 2022-04-29 2022-08-19 北京百度网讯科技有限公司 Website webpage processing method and device, electronic equipment and storage medium
CN114925308B (en) * 2022-04-29 2023-10-03 北京百度网讯科技有限公司 Webpage processing method and device of website, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113239256B (en) 2024-02-23

Similar Documents

Publication Publication Date Title
CN111625635A (en) Question-answer processing method, language model training method, device, equipment and storage medium
US11055373B2 (en) Method and apparatus for generating information
CN113239256A (en) Method for generating website signature and method and device for identifying website
CN113377373A (en) Page loading method and device based on analysis engine, computer equipment and medium
CN113657395A (en) Text recognition method, and training method and device of visual feature extraction model
CN114244795B (en) Information pushing method, device, equipment and medium
CN114398138A (en) Interface generation method and device, computer equipment and storage medium
CN108256078B (en) Information acquisition method and device
CN113656737A (en) Webpage content display method and device, electronic equipment and storage medium
CN110852057A (en) Method and device for calculating text similarity
CN111931524A (en) Method, apparatus, device and storage medium for outputting information
CN116561298A (en) Title generation method, device, equipment and storage medium based on artificial intelligence
CN111666417A (en) Method and device for generating synonyms, electronic equipment and readable storage medium
CN113792232B (en) Page feature calculation method, page feature calculation device, electronic equipment, page feature calculation medium and page feature calculation program product
JP2023010805A (en) Method for training document information extraction model and extracting document information, device, electronic apparatus, storage medium and computer program
CN114880498A (en) Event information display method and device, equipment and medium
JP2024507029A (en) Web page identification methods, devices, electronic devices, media and computer programs
CN115238078A (en) Webpage information extraction method, device, equipment and storage medium
JP2023554210A (en) Sort model training method and apparatus for intelligent recommendation, intelligent recommendation method and apparatus, electronic equipment, storage medium, and computer program
CN114218431A (en) Video searching method and device, electronic equipment and storage medium
CN114254650A (en) Information processing method, device, equipment and medium
CN114490969A (en) Question and answer method and device based on table and electronic equipment
CN113221035A (en) Method, apparatus, device, medium, and program product for determining an abnormal web page
CN112016017A (en) Method and device for determining characteristic data
CN113407890B (en) Information extraction method, device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant