CN112434250B - CMS (content management system) identification feature rule extraction method based on online website - Google Patents

CMS (content management system) identification feature rule extraction method based on online website Download PDF

Info

Publication number
CN112434250B
CN112434250B CN202011473245.8A CN202011473245A CN112434250B CN 112434250 B CN112434250 B CN 112434250B CN 202011473245 A CN202011473245 A CN 202011473245A CN 112434250 B CN112434250 B CN 112434250B
Authority
CN
China
Prior art keywords
file
site
page
static
files
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011473245.8A
Other languages
Chinese (zh)
Other versions
CN112434250A (en
Inventor
徐振标
杨彬彬
郝强健
王超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui San Shi Software Technology Co ltd
Original Assignee
Anhui Sanshi Information Technology Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Sanshi Information Technology Service Co ltd filed Critical Anhui Sanshi Information Technology Service Co ltd
Priority to CN202011473245.8A priority Critical patent/CN112434250B/en
Publication of CN112434250A publication Critical patent/CN112434250A/en
Application granted granted Critical
Publication of CN112434250B publication Critical patent/CN112434250B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a CMS (content management system) identification feature rule extraction method based on an online website, which comprises the following steps of: s1: acquiring a demonstration site address from the Internet, collecting one or more demonstration sites on a certain CMS line, and acquiring a CMS keyword list; s2: crawling a static file address in page content of a presentation site, accessing and acquiring pages and links of the presentation site, wherein the accessed pages and links comprise but are not limited to pages or links formed by a site first page, pages formed by random character strings 404, site domain name concatenation/admin,/robots.txt,/README.md,/LICENSE.txt and the like, and acquiring and extracting in-site page links and static file link addresses containing keywords in page content. Under the condition of a passive code, the representative rule file is extracted in a mode of crawling one or more demonstration site page links, automatic collection of the fingerprint identification rule is achieved, collection efficiency is improved, and rule identification accuracy is improved.

Description

CMS (content management system) recognition feature rule extraction method based on online website
Technical Field
The invention relates to the field of content management, in particular to a CMS (content management system) identification feature rule extraction method based on an online website.
Background
Content management system, english full name: content Management System, abbreviated CMS in English. The content management system is a system for managing and maintaining columns, contents and templates of a website by a programming language running on a server side. With the continuous development of the internet, the types of CMSs are more and more nowadays, developers do not need to develop a website from scratch, and the website can be quickly established only by downloading required open source website establishing programs from the internet, so that a large number of websites established by using the CMSs exist in the internet; in network security, identifying which CMS program is used by a website has an important influence on security testing work, and the workload in a security testing link can be greatly reduced by accurately identifying the CMS. The general method for identifying the web fingerprint of the website comprises the following steps: identifying whether the home page content contains a certain keyword, judging whether a certain page contains the certain keyword, judging whether md5 of a certain static file of the website is an expected value, and the like. The traditional method for collecting the web fingerprint rules comprises the following steps: determining that a certain website is a certain open source website building program, finding a certain static file as a file specific to the program, for example, a logo picture, js or css file of the website building program contains the name of the brand, that is, the file is a file specific to the program, and defining that the URL absolute path of the file is a characteristic file path of the website building program, the md5 value of the file is a characteristic value of the website building program, and the characteristic file path + the characteristic value + the website building program of the brand form a web fingerprint identification rule.
With the increasing variety of CMSs on the internet, how to quickly enrich and identify a rule base of web fingerprint identification becomes a key for improving the efficiency of web fingerprint identification, and a traditional method is to manually find a feature file path, but the efficiency is very low: feature files need to be searched from webpage source codes based on experience, most of the feature files are in a special directory of a website program, and the viewed webpage may not be loaded with the feature files; and the precision is poor: the feature files which are easy to find are not unique to the station building program, so that the false recognition rate is high.
Therefore, in the situation that CMSs are increasing, the existing rule base extraction method for CMS fingerprint identification cannot meet the requirement, and how to efficiently collect the web fingerprint rule is a problem to be urgently solved for accurately identifying the web fingerprint information of the website.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: how to solve the problem that the conventional CMS fingerprint identification rule base extraction method cannot be met, how to efficiently collect the web fingerprint rules for accurately identifying the web fingerprint information of the website is a problem which needs to be solved urgently, and a CMS identification feature rule extraction method based on an online website is provided.
The invention solves the technical problems through the following technical scheme, and the invention comprises the following steps:
s1: acquiring a presentation site address from the Internet, collecting one or more presentation sites on a line of a certain CMS, and acquiring a CMS keyword list;
s2: crawling a static file address in page content of a presentation site, accessing and acquiring pages and links of the presentation site, wherein the accessed pages and links comprise but are not limited to pages or links formed by a site first page, pages formed by random character strings (404), site domain name concatenation/admin,/robots.txt,/README.md,/LICENSE.txt, and acquiring and extracting in-site page links and static file link addresses containing keywords in page content;
s3: extracting co-existing pictures with the same md5 value or files with keywords from the static files acquired by the multiple presentation sites, comparing the static links acquired by the step S2 of the multiple presentation sites transmitted by the step S1, finding out picture files with the same md5 value and js and css files with the same path and containing keyword information, and forming a result set R1;
s4: the multiple presentation sites mutually judge whether static file addresses of other presentation sites exist, sequentially check whether CMS feature sets of the example sites exist under the other example sites or not to form a result set R2, return the result sets R1 and R2, and then manually find representative rule files and paths in the result sets.
Preferably, the process of acquiring the home page information in step S2 is as follows: accessing an input demonstration site home page, acquiring page content, extracting a next-level page link address under the home page and all js, css and image link addresses, judging whether js and css files contain keyword information or not, judging the picture type and acquiring an md5 value;
the page 404 in step S2 includes: splicing an address of a page formed by connecting random characters in series with an input demonstration site, a domain name or an ip, accessing to obtain the content of a returned page, wherein the network state code is 404;
the "/admin" page in step S2 includes: the input demonstration site is spliced with "/admin" to form a new page address and access the new page address, whether the new page address can be normally opened or not is judged, and the returned network state code is 200;
the "/robots. txt" file link in said step S2 includes: input demonstration sites, domain names or ip concatenation "/keywords. txt" form static file link addresses and access, whether the files can be normally opened or not is judged, a returned network state code is 200, and whether the file contents contain keyword information or not is judged;
the "/README. md" file link in said step S2 includes: the input demonstration site, domain name or ip splicing "/README. md" forms a static file link address and accesses, judges whether the file can be normally opened and returns a network status code of 200, and judges whether the file content contains keyword information;
the "/LICENSE. txt" file link in said step S2 includes: the input demonstration site, domain name or ip splicing "/LICENSE.txt" forms a static file link address and accesses, whether the file can be normally opened and the returned network status code is 200 is judged, and whether the file content contains keyword information is judged;
summarizing the js link or suffix present in the above pages as.js, css link or suffix as.css, and the picture link suffixes include, but are not limited to: storing a result set formed by static files of jpg,. png,. jpeg,. ico and. gif into a first database, wherein the stored data content comprises the relative address of the static files, the image format file calculates the MD5 value, the js and css files search whether the text content has the keywords input in the step S1, and if the text content has the row content of the stored keywords.
Preferably, the specific steps of the comparison in S3 are as follows: the file comparison method comprises the following steps: and sequentially polling each presentation site to form a static file in a result set through S2, comparing the static file in the result set with static files in result sets obtained by other presentation sites through S2, judging whether the MD5 values of the static files in the picture types are the same, judging whether the js, css and txt static text files contain keyword information, and finally outputting pictures with the same MD5 values and text files with the keyword information in the same file path.
Preferably, the specific process of step S4 is as follows:
SS 1: aggregating CMS feature set file paths under all demonstration sites, and de-duplicating to form a set D1;
SS 2: sequentially traversing each site, and removing all feature set file paths under the currently traversed site from the set D1 formed by the SS 1; then splicing the rest paths of the aggregate files to the URL of the site to request access;
SS 3: judging a return value after the spliced URL accesses, and recording a URL which can be normally accessed and has a return status code of 200, wherein the URL comprises a picture address and js, css and txt static files;
the picture address is used for calculating whether the MD5 value of the picture is the same as the MD5 value of other sites, and if the MD5 values are all the same, the recorded data content includes the relative address of the picture file and the MD5 value;
the js, css and txt static files are used for judging whether keyword information exists in file contents or not, if yes, the keyword information is recorded, and the recorded data contents comprise the relative address of the text file, the keyword and the content of a row where the keyword is located;
the results recorded in step SS3 are merged to form result set R2, which is returned to result sets R1 and R2, where representative rule files and paths are then found.
Preferably, the keyword information is a cms name, or other keyword associated with the cms.
Compared with the prior art, the invention has the following advantages: according to the CMS identification feature rule extraction method based on the online website, under the condition of a passive code, a representative rule file is extracted in a mode of crawling one or more demonstration site page links, automatic collection of fingerprint identification rules is achieved, collection efficiency is improved, and rule identification accuracy is improved.
Drawings
FIG. 1 is an overall flow diagram of the present invention;
FIG. 2 is a flowchart of step S2 of the present invention;
FIG. 3 is a flowchart of step S3 of the present invention;
fig. 4 is a flowchart of step S4 of the present invention.
Detailed Description
The following examples are given for the detailed implementation and specific operation of the present invention, but the scope of the present invention is not limited to the following examples.
As shown in fig. 1 to 4, the present embodiment provides a technical solution: a CMS recognition feature rule extraction method based on an online website comprises the following steps:
s1: acquiring a demonstration site address from the Internet, collecting one or more demonstration sites on a certain CMS line, and acquiring a CMS keyword list;
s2: crawling a static file address in page content of a presentation site, accessing and acquiring pages and links of the presentation site, wherein the accessed pages and links comprise but are not limited to pages or links formed by a site first page, pages formed by random character strings (404), site domain name concatenation/admin,/robots.txt,/README.md,/LICENSE.txt, and acquiring and extracting in-site page links and static file link addresses containing keywords in page content;
s3: extracting co-existing pictures with the same md5 value or files with keywords from the static files acquired by the multiple presentation sites, comparing the static links acquired by the step S2 of the multiple presentation sites transmitted by the step S1, finding out picture files with the same md5 value and js and css files with the same path and containing keyword information, and forming a result set R1;
s4: the multiple presentation sites mutually judge whether static file addresses of other presentation sites exist, sequentially check whether CMS feature sets of the example sites exist under the other example sites or not to form a result set R2, return the result sets R1 and R2, and then manually find representative rule files and paths in the result sets.
The procedure of acquiring the home page information in step S2 is as follows: accessing an input first page of a demonstration site, acquiring page content, extracting a next-level page link address and all js, css and image link addresses under the first page, judging whether the js and css files contain keyword information or not, judging the picture type and acquiring an md5 value;
the page 404 in step S2 includes: splicing a page address formed by connecting random characters in series with an input presentation site, a domain name or an ip, accessing to obtain returned page contents, wherein a network status code is 404;
the "/admin" page in step S2 includes: the input demonstration site is spliced with "/admin" to form a new page address and access the new page address, whether the new page address can be normally opened or not is judged, and the returned network state code is 200;
the "/robots. txt" file link in said step S2 includes: input demonstration sites, domain names or ip concatenation "/keywords. txt" form static file link addresses and access, whether the files can be normally opened or not is judged, a returned network state code is 200, and whether the file contents contain keyword information or not is judged;
the "/README. md" file link in said step S2 includes: the input demonstration site, domain name or ip splicing "/README. md" forms a static file link address and accesses, judges whether the file can be normally opened and returns a network status code of 200, and judges whether the file content contains keyword information;
the "/LICENSE. txt" file link in said step S2 includes: the input demonstration site, domain name or ip splicing "/LICENSE.txt" forms a static file link address and accesses, whether the file can be normally opened and the returned network status code is 200 is judged, and whether the file content contains keyword information is judged;
summarizing the js link or suffix present in the above pages as.js, css link or suffix as.css, and the picture link suffixes include, but are not limited to: storing a result set formed by static files of jpg,. png,. jpeg,. ico and. gif into a first database, wherein the stored data content comprises the relative address of the static files, the image format file calculates the MD5 value, the js and css files search whether the text content has the keywords input in the step S1, and if the text content has the row content of the stored keywords.
The specific steps of the comparison in the step S3 are as follows: the file comparison method comprises the following steps: and sequentially polling each presentation site to form a static file in a result set through S2, comparing the static file in the result set with static files in result sets obtained by other presentation sites through S2, judging whether the MD5 values of the static files in the picture types are the same, judging whether the js, css and txt static text files contain keyword information, and finally outputting pictures with the same MD5 values and text files with the keyword information in the same file path.
The specific process of step S4 is as follows:
SS 1: aggregating CMS feature set file paths under all demonstration sites, and de-duplicating to form a set D1;
SS 2: sequentially traversing each site, and removing all feature set file paths under the currently traversed site from the set D1 formed by the SS 1; then splicing the rest paths of the aggregate file to the URL of the site to request access;
SS 3: judging a return value after the spliced URL accesses, and recording a URL which can be normally accessed and has a return status code of 200, wherein the URL comprises a picture address and js, css and txt static files;
the picture address is used for calculating whether the MD5 value of the picture is the same as the MD5 value of other sites, and if the MD5 values are the same, the recorded data content includes the relative address of the picture file and the MD5 value;
the js, css and txt static files are used for judging whether keyword information exists in file contents, if yes, the keyword information is recorded, and the recorded data contents comprise the relative address of the text file, the keyword and the content of the line where the keyword exists;
the results recorded in step SS3 are merged to form result set R2, which is returned to result sets R1 and R2, where representative rule files and paths are then found.
The keyword information is a cms name or other keywords associated with the cms.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (5)

1. A CMS identification feature rule extraction method based on an online website is characterized by comprising the following steps:
s1: acquiring a presentation site address from the Internet, collecting one or more presentation sites on a line of a certain CMS, and acquiring a CMS keyword list;
s2: crawling a static file address in page content of a presentation site, accessing and acquiring pages and links of the presentation site, wherein the accessed pages and links comprise but are not limited to pages or links formed by a site first page, pages formed by random character strings (404), site domain name concatenation/admin,/robots.txt,/README.md,/LICENSE.txt, and acquiring and extracting in-site page links and static file link addresses containing keywords in page content;
s3: extracting co-existing pictures with the same md5 value or files with keywords from the static files acquired by the multiple presentation sites, comparing the static links acquired by the step S2 of the multiple presentation sites transmitted by the step S1, finding out picture files with the same md5 value and js and css files with the same path and containing keyword information, and forming a result set R1;
s4: the multiple presentation sites mutually judge whether static file addresses of other presentation sites exist, sequentially check whether CMS feature sets of the example sites exist under the other example sites or not to form a result set R2, return the result sets R1 and R2, and then manually find representative rule files and paths in the result sets.
2. The method of claim 1, wherein the method for extracting the CMS recognition feature rule based on the online website comprises: the procedure of acquiring the home page information in step S2 is as follows: accessing an input first page of a demonstration site, acquiring page content, extracting a next-level page link address and all js, css and image link addresses under the first page, judging whether the js and css files contain keyword information or not, judging the picture type and acquiring an md5 value;
the page 404 in step S2 includes: splicing a page address formed by connecting random characters in series with an input demonstration site, a domain name or an ip, accessing to obtain returned page contents, wherein a network status code is a preset numerical value;
the "/admin" page in step S2 includes: the input demonstration site is spliced with "/admin" to form a new page address and access the new page address, whether the new page address can be normally opened or not is judged, and a returned network state code is a preset numerical value;
the "/peers. txt" file link in the step S2 includes: input demonstration sites, domain names or ip concatenation '/robot.txt' form static file link addresses and access the static file link addresses, whether the static file link addresses can be normally opened or not is judged, a returned network state code is a preset numerical value, and whether the file content contains keyword information or not is judged;
the "/readem. md" file link in said step S2 includes: the input demonstration site, domain name or ip splicing '/README.md' form a static file link address and access, whether the file can be normally opened and the returned network state code is a preset value is judged, and whether the file content contains keyword information is judged;
the "/LICENSE. txt" file link in said step S2 includes: the input demonstration site, domain name or ip splicing "/LICENSE.txt" forms a static file link address and accesses, whether the file can be normally opened and the returned network state code is a preset value is judged, and whether the file content contains keyword information is judged;
summarizing the js link or suffix present in the above pages as.js, css link or suffix as.css, and the picture link suffixes include, but are not limited to: storing a result set formed by static files of jpg,. png,. jpeg,. ico and. gif into a first database, wherein the stored data content comprises the relative address of the static files, the image format file calculates the MD5 value, the js and css files search whether the text content has the keywords input in the step S1, and if the text content has the row content of the stored keywords.
3. The method of claim 1, wherein the CMS identification feature rule extracting method based on online website comprises: the specific steps of the comparison in the step S3 are as follows: the file comparison method comprises the following steps: and sequentially polling each presentation site to form a static file in a result set through S2, comparing the static file in the result set with static files in result sets obtained by other presentation sites through S2, judging whether the MD5 values of the static files in the picture types are the same, judging whether the js, css and txt static text files contain keyword information, and finally outputting pictures with the same MD5 values and text files with the keyword information in the same file path.
4. The method of claim 1, wherein the method for extracting the CMS recognition feature rule based on the online website comprises: the specific process of step S4 is as follows:
SS 1: aggregating CMS feature set file paths under all demonstration sites, and de-duplicating to form a set D1;
SS 2: sequentially traversing each site, and removing all feature set file paths under the currently traversed site from the set D1 formed by the SS 1; then splicing the rest paths of the aggregate file to the URL of the site to request access;
SS 3: judging a return value after the spliced URL accesses, and recording a URL which can be normally accessed and has a return state code of a preset value, wherein the URL comprises a picture address and js, css and txt static files;
the picture address is used for calculating whether the MD5 value of the picture is the same as the MD5 value of other sites, and if the MD5 values are the same, the recorded data content includes the relative address of the picture file and the MD5 value;
the js, css and txt static files are used for judging whether keyword information exists in file contents, if yes, the keyword information is recorded, and the recorded data contents comprise the relative address of the text file, the keyword and the content of the line where the keyword exists;
the results recorded in step SS3 are merged to form result set R2, which is returned to result sets R1 and R2, where representative rule files and paths are then found.
5. The method of claim 1, wherein the method for extracting the CMS recognition feature rule based on the online website comprises: the keyword information is a cms name, or other keyword associated with the cms.
CN202011473245.8A 2020-12-15 2020-12-15 CMS (content management system) identification feature rule extraction method based on online website Active CN112434250B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011473245.8A CN112434250B (en) 2020-12-15 2020-12-15 CMS (content management system) identification feature rule extraction method based on online website

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011473245.8A CN112434250B (en) 2020-12-15 2020-12-15 CMS (content management system) identification feature rule extraction method based on online website

Publications (2)

Publication Number Publication Date
CN112434250A CN112434250A (en) 2021-03-02
CN112434250B true CN112434250B (en) 2022-07-12

Family

ID=74691629

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011473245.8A Active CN112434250B (en) 2020-12-15 2020-12-15 CMS (content management system) identification feature rule extraction method based on online website

Country Status (1)

Country Link
CN (1) CN112434250B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420818A (en) * 2021-06-27 2021-09-21 杭州迪普科技股份有限公司 Content management system identification method and device
CN114969603A (en) * 2022-05-27 2022-08-30 中移互联网有限公司 5G message-based picture acquisition and picture generation method and system

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1549012A1 (en) * 2003-12-24 2005-06-29 DataCenterTechnologies N.V. Method and system for identifying the content of files in a network
US9654495B2 (en) * 2006-12-01 2017-05-16 Websense, Llc System and method of analyzing web addresses
CN104978423A (en) * 2015-06-30 2015-10-14 北京奇虎科技有限公司 Website type detection method and apparatus
EP3223174A1 (en) * 2016-03-23 2017-09-27 Tata Consultancy Services Limited Method and system for selecting sample set for assessing the accessibility of a website
TWI695277B (en) * 2018-06-29 2020-06-01 國立臺灣師範大學 Automatic website data collection method
CN109376291B (en) * 2018-11-08 2020-11-24 杭州安恒信息技术股份有限公司 Website fingerprint information scanning method and device based on web crawler
CN110489701A (en) * 2019-08-19 2019-11-22 安徽三实信息技术服务有限公司 Extract the method, apparatus and CMS recognition methods of CMS identification feature
CN110825941A (en) * 2019-10-17 2020-02-21 北京天融信网络安全技术有限公司 Content management system identification method, device and storage medium
CN111597490A (en) * 2020-05-21 2020-08-28 深圳前海微众银行股份有限公司 Web fingerprint identification method, device, equipment and computer storage medium

Also Published As

Publication number Publication date
CN112434250A (en) 2021-03-02

Similar Documents

Publication Publication Date Title
US20220164401A1 (en) Systems and methods for dynamically creating hyperlinks associated with relevant multimedia content
JP4097602B2 (en) Information analysis method and apparatus
US8630972B2 (en) Providing context for web articles
US6304872B1 (en) Search system for providing fulltext search over web pages of world wide web servers
US20070143317A1 (en) Mechanism for managing facts in a fact repository
CN102200980B (en) Method and system for providing network resources
US8271495B1 (en) System and method for automating categorization and aggregation of content from network sites
US9864793B2 (en) Language tag management on international data storage
US11256744B2 (en) Method, apparatus and software for differentiating two or more data sets having common data set identifiers
US8560518B2 (en) Method and apparatus for building sales tools by mining data from websites
WO2006103392A1 (en) Content adaptation
CN112434250B (en) CMS (content management system) identification feature rule extraction method based on online website
CN112445997A (en) Method and device for extracting CMS multi-version identification feature rule
CN111224923A (en) Detection method, device and system for counterfeit websites
CN110795397B (en) Automatic identification method for catalogue and file type of geological data packet
JPWO2003060764A1 (en) Information retrieval system
US20110022563A1 (en) Document display system, related document display method, and program
CN116226494B (en) Crawler system and method for information search
CN112269906A (en) Automatic extraction method and device of webpage text
CN109948015B (en) Meta search list result extraction method and system
JP4649036B2 (en) Category reporting method, record reporting method, search service device by search server
JP2010272006A (en) Relation extraction apparatus, relation extraction method and program
JP5380874B2 (en) Information retrieval method, program and apparatus
US11176312B2 (en) Managing content of an online information system
KR100659370B1 (en) Method for constructing a document database and method for searching information by matching thesaurus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240229

Address after: 6/F, Building F2, Xingmengyuan Scientific Research, No. 198 Mingzhu Road, High tech Zone, Hefei City, Anhui Province, 230000

Patentee after: ANHUI SAN SHI SOFTWARE TECHNOLOGY Co.,Ltd.

Country or region after: China

Address before: Room 408, building a, 5F Pioneer Park, 118 science Avenue, high tech Zone, Hefei, Anhui 230000

Patentee before: ANHUI SANSHI INFORMATION TECHNOLOGY SERVICE CO.,LTD.

Country or region before: China