CN112434250A - CMS (content management system) identification feature rule extraction method based on online website - Google Patents

CMS (content management system) identification feature rule extraction method based on online website Download PDF

Info

Publication number
CN112434250A
CN112434250A CN202011473245.8A CN202011473245A CN112434250A CN 112434250 A CN112434250 A CN 112434250A CN 202011473245 A CN202011473245 A CN 202011473245A CN 112434250 A CN112434250 A CN 112434250A
Authority
CN
China
Prior art keywords
file
site
page
static
cms
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011473245.8A
Other languages
Chinese (zh)
Other versions
CN112434250B (en
Inventor
徐振标
杨彬彬
郝强健
王超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui San Shi Software Technology Co ltd
Original Assignee
Anhui Sanshi Information Technology Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Sanshi Information Technology Service Co ltd filed Critical Anhui Sanshi Information Technology Service Co ltd
Priority to CN202011473245.8A priority Critical patent/CN112434250B/en
Publication of CN112434250A publication Critical patent/CN112434250A/en
Application granted granted Critical
Publication of CN112434250B publication Critical patent/CN112434250B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a CMS (content management system) identification feature rule extraction method based on an online website, which comprises the following steps of: s1: acquiring a demonstration site address from the Internet, collecting one or more demonstration sites on a certain CMS line, and acquiring a CMS keyword list; s2: crawling a static file address in page content of a presentation site, accessing and acquiring pages and links of the presentation site, wherein the accessed pages and links comprise but are not limited to pages or links formed by a site first page, pages formed by random character strings 404, site domain name concatenation/admin,/robots.txt,/README.md,/LICENSE.txt and the like, and acquiring and extracting in-site page links and static file link addresses containing keywords in page content. Under the condition of a passive code, the representative rule file is extracted in a mode of crawling one or more demonstration site page links, automatic collection of the fingerprint identification rule is achieved, collection efficiency is improved, and rule identification accuracy is improved.

Description

CMS (content management system) identification feature rule extraction method based on online website
Technical Field
The invention relates to the field of content management, in particular to a CMS (content management system) identification feature rule extraction method based on an online website.
Background
Content management system, english full name: content Management System, abbreviated CMS in English. The content management system is a system for managing and maintaining columns, contents and templates of a website by a programming language running on a server side. With the continuous development of the internet, the types of CMSs are more and more nowadays, developers do not need to develop a website from scratch, and the website can be quickly established only by downloading required open source website establishing programs from the internet, so that a large number of websites established by using the CMSs exist in the internet; in network security, identifying which CMS program is used by a website has an important influence on security testing work, and the workload in a security testing link can be greatly reduced by accurately identifying the CMS. The general method for identifying the web fingerprint of the website comprises the following steps: identifying whether the home page content contains a certain keyword, judging whether a certain page contains the certain keyword, judging whether md5 of a certain static file of the website is an expected value, and the like. The traditional method for collecting the web fingerprint rules comprises the following steps: determining that a certain website is a certain open source website building program, finding a certain static file as a file specific to the program, such as a logo picture, js or css file of the website building program containing the name of the brand, that is, the file is a file specific to the program, and defining that the URL absolute path of the file is the path of the feature file of the website building program, the md5 value of the file is the feature value of the website building program, and the path of the feature file + the feature value + the website building program of the brand form a web fingerprint identification rule.
With the increasing variety of CMSs on the internet, how to quickly enrich and identify a rule base of web fingerprint identification becomes a key for improving the efficiency of web fingerprint identification, and a traditional method is to manually find a feature file path, but the efficiency is very low: feature files need to be searched from webpage source codes based on experience, most of the feature files are in a special directory of a website program, and the viewed webpage may not be loaded with the feature files; and the precision is poor: the feature files which are easy to find are not unique to the station building program, so that the false recognition rate is high.
Therefore, in the situation that CMSs are increasing, the existing rule base extraction method for CMS fingerprint identification cannot meet the requirement, and how to efficiently collect the web fingerprint rule for accurately identifying the web fingerprint information of the website is a problem which needs to be solved urgently.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: how to solve the problem that the conventional CMS fingerprint identification rule base extraction method cannot be met, how to efficiently collect the web fingerprint rules for accurately identifying the web fingerprint information of the website is a problem which needs to be solved urgently, and a CMS identification feature rule extraction method based on an online website is provided.
The invention solves the technical problems through the following technical scheme, and the invention comprises the following steps:
s1: acquiring a demonstration site address from the Internet, collecting one or more demonstration sites on a certain CMS line, and acquiring a CMS keyword list;
s2: crawling a static file address in page content of a demonstration site, accessing and acquiring pages and links of the demonstration site, wherein the accessed pages and links comprise but are not limited to pages or links formed by a site first page, a 404 page formed by a random character string, site domain name concatenation/admin,/robots.txt,/README.md,/LICENSE.txt and the like, and acquiring and extracting in-site page links and static file link addresses containing keywords in page content;
s3: extracting pictures which exist together and have the same md5 value or files with keywords from the static files acquired by the multiple demonstration sites, comparing the static links acquired by the multiple demonstration sites through the step S2, finding out the pictures under the same path and having the same md5 value, and js and css files which have the same path and contain keyword information, and forming a result set R1;
s4: the multiple presentation sites mutually judge whether static file addresses of other presentation sites exist, sequentially check whether CMS feature sets of the example sites exist under the other example sites or not to form a result set R2, return the result sets R1 and R2, and then manually find representative rule files and paths in the result sets.
Preferably, the process of acquiring the home page information in step S2 is as follows: accessing an input first page of a demonstration site, acquiring page content, extracting a next-level page link address and all js, css and image link addresses under the first page, judging whether the js and css files contain keyword information or not, judging the picture type and acquiring an md5 value;
the page 404 in step S2 includes: splicing an address of a page formed by connecting random characters in series with an input demonstration site, a domain name or an ip, accessing to obtain the content of a returned page, wherein the network state code is 404;
the "/admin" page in step S2 includes: the input demonstration site is spliced with "/admin" to form a new page address and access the new page address, whether the new page address can be normally opened or not is judged, and the returned network state code is 200;
the "/robots. txt" file link in said step S2 includes: input demonstration sites, domain names or ip concatenation "/keywords. txt" form static file link addresses and access, whether the files can be normally opened or not is judged, a returned network state code is 200, and whether the file contents contain keyword information or not is judged;
the "/README. md" file link in said step S2 includes: the input demonstration site, domain name or ip splicing "/README. md" forms a static file link address and accesses, judges whether the file can be normally opened and returns a network status code of 200, and judges whether the file content contains keyword information;
the "/LICENSE. txt" file link in said step S2 includes: the input demonstration site, domain name or ip splicing "/LICENSE.txt" forms a static file link address and accesses, whether the file can be normally opened and the returned network status code is 200 is judged, and whether the file content contains keyword information is judged;
summarizing the js link or suffix present in the above pages as.js, css link or suffix as.css, and the picture link suffixes include, but are not limited to: and storing a result set formed by static files such as jpg, png, jpeg, ico and gif into a first database, wherein the stored data content comprises the relative address of the static files, the image format file calculates the MD5 value, and files such as js and css search whether the text content has the keywords input in the step S1 or not, and if yes, the content of the line where the stored keywords exist.
Preferably, the specific steps of the comparison in S3 are as follows: the file comparison method comprises the following steps: and sequentially polling each presentation site to form a static file in a result set through S2 and comparing the static file in the result set obtained by other presentation sites through S2, wherein the static file of the picture type needs to judge whether the MD5 values are the same, and the static text files such as js, css, txt and the like need to judge whether keyword information exists, and finally, pictures with the same MD5 values and text files with the keyword information exist in the same file path are output.
Preferably, the specific process of step S4 is as follows:
SS 1: aggregating CMS feature set file paths under all demonstration sites, and de-duplicating to form a set D1;
SS 2: sequentially traversing each site, and removing all feature set file paths under the currently traversed site from the set D1 formed by the SS 1; then splicing the rest paths of the aggregate file to the URL of the site to request access;
SS 3: judging a return value after the spliced URL accesses, and recording a URL which can be normally accessed and has a return status code of 200, wherein the URL comprises a picture address and static files such as js, css, txt and the like;
the picture address is used for calculating whether the MD5 value of the picture is the same as the MD5 value of other sites, if the MD5 values are the same, the recorded data content has the relative address of the picture file and the MD5 value;
the static files such as js, css, txt and the like are used for judging whether keyword information exists in file contents, if yes, the keyword information is recorded, and the recorded data contents comprise the relative address of the text file, the keyword and the content of the line where the keyword exists;
the results recorded in the step SS3 are merged to form a result set R2, the result sets R1 and R2 are returned, and then representative rule files and paths are found therein.
Preferably, the keyword information is a cms name, or other keyword associated with the cms.
Compared with the prior art, the invention has the following advantages: according to the CMS identification feature rule extraction method based on the online website, under the condition of a passive code, a representative rule file is extracted in a mode of crawling one or more demonstration site page links, automatic collection of fingerprint identification rules is achieved, collection efficiency is improved, and rule identification accuracy is improved.
Drawings
FIG. 1 is an overall flow diagram of the present invention;
FIG. 2 is a flowchart of step S2 of the present invention;
FIG. 3 is a flowchart of step S3 of the present invention;
fig. 4 is a flowchart of step S4 of the present invention.
Detailed Description
The following examples are given for the detailed implementation and specific operation of the present invention, but the scope of the present invention is not limited to the following examples.
As shown in fig. 1 to 4, the present embodiment provides a technical solution: a CMS recognition feature rule extraction method based on an online website comprises the following steps:
s1: acquiring a demonstration site address from the Internet, collecting one or more demonstration sites on a certain CMS line, and acquiring a CMS keyword list;
s2: crawling a static file address in page content of a demonstration site, accessing and acquiring pages and links of the demonstration site, wherein the accessed pages and links comprise but are not limited to pages or links formed by a site first page, a 404 page formed by a random character string, site domain name concatenation/admin,/robots.txt,/README.md,/LICENSE.txt and the like, and acquiring and extracting in-site page links and static file link addresses containing keywords in page content;
s3: extracting pictures which exist together and have the same md5 value or files with keywords from the static files acquired by the multiple demonstration sites, comparing the static links acquired by the multiple demonstration sites through the step S2, finding out the pictures under the same path and having the same md5 value, and js and css files which have the same path and contain keyword information, and forming a result set R1;
s4: the multiple presentation sites mutually judge whether static file addresses of other presentation sites exist, sequentially check whether CMS feature sets of the example sites exist under the other example sites or not to form a result set R2, return the result sets R1 and R2, and then manually find representative rule files and paths in the result sets.
The procedure of acquiring the home page information in step S2 is as follows: accessing an input first page of a demonstration site, acquiring page content, extracting a next-level page link address and all js, css and image link addresses under the first page, judging whether the js and css files contain keyword information or not, judging the picture type and acquiring an md5 value;
the page 404 in step S2 includes: splicing an address of a page formed by connecting random characters in series with an input demonstration site, a domain name or an ip, accessing to obtain the content of a returned page, wherein the network state code is 404;
the "/admin" page in step S2 includes: the input demonstration site is spliced with "/admin" to form a new page address and access the new page address, whether the new page address can be normally opened or not is judged, and the returned network state code is 200;
the "/robots. txt" file link in said step S2 includes: input demonstration sites, domain names or ip concatenation "/keywords. txt" form static file link addresses and access, whether the files can be normally opened or not is judged, a returned network state code is 200, and whether the file contents contain keyword information or not is judged;
the "/readem. md" file link in said step S2 includes: the input demonstration site, domain name or ip splicing "/README. md" forms a static file link address and accesses, judges whether the file can be normally opened and returns a network status code of 200, and judges whether the file content contains keyword information;
the "/LICENSE. txt" file link in said step S2 includes: the input demonstration site, domain name or ip splicing "/LICENSE.txt" forms a static file link address and accesses, whether the file can be normally opened and the returned network status code is 200 is judged, and whether the file content contains keyword information is judged;
summarizing the js link or suffix present in the above pages as.js, css link or suffix as.css, and the picture link suffixes include, but are not limited to: and storing a result set formed by static files such as jpg, png, jpeg, ico and gif into a first database, wherein the stored data content comprises the relative address of the static files, the image format file calculates the MD5 value, and files such as js and css search whether the text content has the keywords input in the step S1 or not, and if yes, the content of the line where the stored keywords exist.
The specific steps of the comparison in the step S3 are as follows: the file comparison method comprises the following steps: and sequentially polling each presentation site to form a static file in a result set through S2 and comparing the static file in the result set obtained by other presentation sites through S2, wherein the static file of the picture type needs to judge whether the MD5 values are the same, and the static text files such as js, css, txt and the like need to judge whether keyword information exists, and finally, pictures with the same MD5 values and text files with the keyword information exist in the same file path are output.
The specific process of step S4 is as follows:
SS 1: aggregating CMS feature set file paths under all demonstration sites, and de-duplicating to form a set D1;
SS 2: sequentially traversing each site, and removing all feature set file paths under the currently traversed site from the set D1 formed by the SS 1; then splicing the rest paths of the aggregate file to the URL of the site to request access;
SS 3: judging a return value after the spliced URL accesses, and recording a URL which can be normally accessed and has a return status code of 200, wherein the URL comprises a picture address and static files such as js, css, txt and the like;
the picture address is used for calculating whether the MD5 value of the picture is the same as the MD5 value of other sites, if the MD5 values are the same, the recorded data content has the relative address of the picture file and the MD5 value;
the static files such as js, css, txt and the like are used for judging whether keyword information exists in file contents, if yes, the keyword information is recorded, and the recorded data contents comprise the relative address of the text file, the keyword and the content of the line where the keyword exists;
the results recorded in the step SS3 are merged to form a result set R2, the result sets R1 and R2 are returned, and then representative rule files and paths are found therein.
The keyword information is a cms name, or other keyword associated with the cms.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (5)

1. A CMS recognition feature rule extraction method based on an online website is characterized by comprising the following steps:
s1: acquiring a demonstration site address from the Internet, collecting one or more demonstration sites on a certain CMS line, and acquiring a CMS keyword list;
s2: crawling a static file address in page content of a demonstration site, accessing and acquiring pages and links of the demonstration site, wherein the accessed pages and links comprise but are not limited to pages or links formed by a site first page, a 404 page formed by a random character string, site domain name concatenation/admin,/robots.txt,/README.md,/LICENSE.txt and the like, and acquiring and extracting in-site page links and static file link addresses containing keywords in page content;
s3: extracting pictures which exist together and have the same md5 value or files with keywords from the static files acquired by the multiple demonstration sites, comparing the static links acquired by the multiple demonstration sites through the step S2, finding out the pictures under the same path and having the same md5 value, and js and css files which have the same path and contain keyword information, and forming a result set R1;
s4: the multiple presentation sites mutually judge whether static file addresses of other presentation sites exist, sequentially check whether CMS feature sets of the example sites exist under the other example sites or not to form a result set R2, return the result sets R1 and R2, and then manually find representative rule files and paths in the result sets.
2. The method of claim 1, wherein the CMS identification feature rule extracting method based on online website comprises: the procedure of acquiring the home page information in step S2 is as follows: accessing an input first page of a demonstration site, acquiring page content, extracting a next-level page link address and all js, css and image link addresses under the first page, judging whether the js and css files contain keyword information or not, judging the picture type and acquiring an md5 value;
the page 404 in step S2 includes: splicing a page address formed by connecting random characters in series with an input demonstration site, a domain name or an ip, accessing to obtain the content of a returned page, wherein the network state code is a preset numerical value;
the "/admin" page in step S2 includes: the input demonstration site is spliced with "/admin" to form a new page address and access the new page address, whether the new page address can be normally opened or not is judged, and a returned network state code is a preset numerical value;
the "/robots. txt" file link in said step S2 includes: input demonstration sites, domain names or ip concatenation '/robot.txt' form static file link addresses and access the static file link addresses, whether the static file link addresses can be normally opened or not is judged, a returned network state code is a preset numerical value, and whether the file content contains keyword information or not is judged;
the "/README. md" file link in said step S2 includes: the input demonstration site, domain name or ip splicing '/README.md' form a static file link address and access, whether the file can be normally opened and the returned network state code is a preset value is judged, and whether the file content contains keyword information is judged;
the "/LICENSE. txt" file link in said step S2 includes: the input demonstration site, domain name or ip splicing "/LICENSE.txt" forms a static file link address and accesses, whether the file can be normally opened and the returned network state code is a preset value is judged, and whether the file content contains keyword information is judged;
summarizing the js link or suffix present in the above pages as.js, css link or suffix as.css, and the picture link suffixes include, but are not limited to: and storing a result set formed by static files such as jpg, png, jpeg, ico and gif into a first database, wherein the stored data content comprises the relative address of the static files, the image format file calculates the MD5 value, and files such as js and css search whether the text content has the keywords input in the step S1 or not, and if yes, the content of the line where the stored keywords exist.
3. The method of claim 1, wherein the CMS identification feature rule extracting method based on online website comprises: the specific steps of the comparison in the step S3 are as follows: the file comparison method comprises the following steps: and sequentially polling each presentation site to form a static file in a result set through S2 and comparing the static file in the result set obtained by other presentation sites through S2, wherein the static file of the picture type needs to judge whether the MD5 values are the same, and the static text files such as js, css, txt and the like need to judge whether keyword information exists, and finally, pictures with the same MD5 values and text files with the keyword information exist in the same file path are output.
4. The method of claim 1, wherein the CMS identification feature rule extracting method based on online website comprises: the specific process of step S4 is as follows:
SS 1: aggregating CMS feature set file paths under all demonstration sites, and de-duplicating to form a set D1;
SS 2: sequentially traversing each site, and removing all feature set file paths under the currently traversed site from the set D1 formed by the SS 1; then splicing the rest paths of the aggregate file to the URL of the site to request access;
SS 3: judging a return value after the spliced URL accesses, and recording a URL which can be normally accessed and has a return state code as a preset value, wherein the URL comprises a picture address and static files such as js, css, txt and the like;
the picture address is used for calculating whether the MD5 value of the picture is the same as the MD5 value of other sites, if the MD5 values are the same, the recorded data content has the relative address of the picture file and the MD5 value;
the static files such as js, css, txt and the like are used for judging whether keyword information exists in file contents, if yes, the keyword information is recorded, and the recorded data contents comprise the relative address of the text file, the keyword and the content of the line where the keyword exists;
the results recorded in the step SS3 are merged to form a result set R2, the result sets R1 and R2 are returned, and then representative rule files and paths are found therein.
5. The method of claim 1, wherein the CMS identification feature rule extracting method based on online website comprises: the keyword information is a cms name, or other keyword associated with the cms.
CN202011473245.8A 2020-12-15 2020-12-15 CMS (content management system) identification feature rule extraction method based on online website Active CN112434250B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011473245.8A CN112434250B (en) 2020-12-15 2020-12-15 CMS (content management system) identification feature rule extraction method based on online website

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011473245.8A CN112434250B (en) 2020-12-15 2020-12-15 CMS (content management system) identification feature rule extraction method based on online website

Publications (2)

Publication Number Publication Date
CN112434250A true CN112434250A (en) 2021-03-02
CN112434250B CN112434250B (en) 2022-07-12

Family

ID=74691629

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011473245.8A Active CN112434250B (en) 2020-12-15 2020-12-15 CMS (content management system) identification feature rule extraction method based on online website

Country Status (1)

Country Link
CN (1) CN112434250B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420818A (en) * 2021-06-27 2021-09-21 杭州迪普科技股份有限公司 Content management system identification method and device
CN114969603A (en) * 2022-05-27 2022-08-30 中移互联网有限公司 5G message-based picture acquisition and picture generation method and system

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1549012A1 (en) * 2003-12-24 2005-06-29 DataCenterTechnologies N.V. Method and system for identifying the content of files in a network
US20080133540A1 (en) * 2006-12-01 2008-06-05 Websense, Inc. System and method of analyzing web addresses
CN104978423A (en) * 2015-06-30 2015-10-14 北京奇虎科技有限公司 Website type detection method and apparatus
US20170277804A1 (en) * 2016-03-23 2017-09-28 Tata Consultancy Services Limited Method and system for selecting sample set for assessing the accessibility of a website
CN109376291A (en) * 2018-11-08 2019-02-22 杭州安恒信息技术股份有限公司 A kind of method and device of the website fingerprint information scanning based on web crawlers
CN110489701A (en) * 2019-08-19 2019-11-22 安徽三实信息技术服务有限公司 Extract the method, apparatus and CMS recognition methods of CMS identification feature
US20200004792A1 (en) * 2018-06-29 2020-01-02 National Taiwan Normal University Automated website data collection method
CN110825941A (en) * 2019-10-17 2020-02-21 北京天融信网络安全技术有限公司 Content management system identification method, device and storage medium
CN111597490A (en) * 2020-05-21 2020-08-28 深圳前海微众银行股份有限公司 Web fingerprint identification method, device, equipment and computer storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1549012A1 (en) * 2003-12-24 2005-06-29 DataCenterTechnologies N.V. Method and system for identifying the content of files in a network
US20080133540A1 (en) * 2006-12-01 2008-06-05 Websense, Inc. System and method of analyzing web addresses
CN104978423A (en) * 2015-06-30 2015-10-14 北京奇虎科技有限公司 Website type detection method and apparatus
US20170277804A1 (en) * 2016-03-23 2017-09-28 Tata Consultancy Services Limited Method and system for selecting sample set for assessing the accessibility of a website
US20200004792A1 (en) * 2018-06-29 2020-01-02 National Taiwan Normal University Automated website data collection method
CN109376291A (en) * 2018-11-08 2019-02-22 杭州安恒信息技术股份有限公司 A kind of method and device of the website fingerprint information scanning based on web crawlers
CN110489701A (en) * 2019-08-19 2019-11-22 安徽三实信息技术服务有限公司 Extract the method, apparatus and CMS recognition methods of CMS identification feature
CN110825941A (en) * 2019-10-17 2020-02-21 北京天融信网络安全技术有限公司 Content management system identification method, device and storage medium
CN111597490A (en) * 2020-05-21 2020-08-28 深圳前海微众银行股份有限公司 Web fingerprint identification method, device, equipment and computer storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420818A (en) * 2021-06-27 2021-09-21 杭州迪普科技股份有限公司 Content management system identification method and device
CN114969603A (en) * 2022-05-27 2022-08-30 中移互联网有限公司 5G message-based picture acquisition and picture generation method and system

Also Published As

Publication number Publication date
CN112434250B (en) 2022-07-12

Similar Documents

Publication Publication Date Title
US7139756B2 (en) System and method for detecting duplicate and similar documents
JP4097602B2 (en) Information analysis method and apparatus
US8645385B2 (en) System and method for automating categorization and aggregation of content from network sites
US20070143317A1 (en) Mechanism for managing facts in a fact repository
US20100169311A1 (en) Approaches for the unsupervised creation of structural templates for electronic documents
US20180060415A1 (en) Language tag management on international data storage
US8631097B1 (en) Methods and systems for finding a mobile and non-mobile page pair
US8560518B2 (en) Method and apparatus for building sales tools by mining data from websites
WO2006103392A1 (en) Content adaptation
CN110969022B (en) Semantic determining method and related equipment
US10372718B2 (en) Systems and methods for enterprise data search and analysis
CN112434250B (en) CMS (content management system) identification feature rule extraction method based on online website
CN112445997A (en) Method and device for extracting CMS multi-version identification feature rule
TW201415254A (en) Method and system for recommending semantic annotations
CN110795397B (en) Automatic identification method for catalogue and file type of geological data packet
CN111224923A (en) Detection method, device and system for counterfeit websites
JPWO2003060764A1 (en) Information retrieval system
KR19990070968A (en) How to Search and Database Your Internet Resources
CN113626558A (en) Intelligent recommendation-based field standardization method and system
US20110022563A1 (en) Document display system, related document display method, and program
CN109948015B (en) Meta search list result extraction method and system
JP4649036B2 (en) Category reporting method, record reporting method, search service device by search server
JP2010272006A (en) Relation extraction apparatus, relation extraction method and program
JP5380874B2 (en) Information retrieval method, program and apparatus
KR100659370B1 (en) Method for constructing a document database and method for searching information by matching thesaurus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240229

Address after: 6/F, Building F2, Xingmengyuan Scientific Research, No. 198 Mingzhu Road, High tech Zone, Hefei City, Anhui Province, 230000

Patentee after: ANHUI SAN SHI SOFTWARE TECHNOLOGY Co.,Ltd.

Country or region after: China

Address before: Room 408, building a, 5F Pioneer Park, 118 science Avenue, high tech Zone, Hefei, Anhui 230000

Patentee before: ANHUI SANSHI INFORMATION TECHNOLOGY SERVICE CO.,LTD.

Country or region before: China

TR01 Transfer of patent right