US12050652B2 - Service packaging method based on web page segmentation and search algorithm - Google Patents

Service packaging method based on web page segmentation and search algorithm Download PDF

Info

Publication number
US12050652B2
US12050652B2 US17/614,978 US201917614978A US12050652B2 US 12050652 B2 US12050652 B2 US 12050652B2 US 201917614978 A US201917614978 A US 201917614978A US 12050652 B2 US12050652 B2 US 12050652B2
Authority
US
United States
Prior art keywords
service
user
information
web page
calling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US17/614,978
Other versions
US20220245203A1 (en
Inventor
Naibo Wang
Xiya LV
Zitong Yang
Tao Wang
Jianwei Yin
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Assigned to ZHEJIANG UNIVERSITY reassignment ZHEJIANG UNIVERSITY ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: LV, Xiya, WANG, Naibo, WANG, TAO, YANG, Zitong, YIN, JIANWEI
Publication of US20220245203A1 publication Critical patent/US20220245203A1/en
Application granted granted Critical
Publication of US12050652B2 publication Critical patent/US12050652B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents

Definitions

  • This present invention relates to the technical field of service computing, and in particular, to a service packaging method based on web page segmentation and search algorithm.
  • a service packaging system is intended to package the data in web pages into a service, and provide RestFul API for calling the service to use the service in the development process for developers.
  • Web page block segmentation technology is the analysis and processing of existing Web page documents, specifically is the technology that the whole Web page is segmented into multiple blocks containing information data, so as to achieve advertisement removal, main information extraction and other functions, which mainly include page block segmentation technology based on node entropy, page block segmentation technology based on visual features, Web page block segmentation technology based on content distance etc.
  • the Web page block segmentation technology has been widely used in various fields of the Internet industry.
  • a service is a collection of API with multiple attributes that belong to a specific service class which is provided by a developer or a class of developers.
  • API is certain predefined function designed to provide applications and developers with the ability to access a set of routines based on a certain software or hardware without having to access the source code or understand the details of the inner workings.
  • API has multiple input and output attributes, belonging to a specific developer, and being subordinate to a specific service.
  • Web crawler also known as web spider, web robot, and more commonly as web chaser in the FOAF community
  • Web crawler is a program or a script that automatically crawl the World Wide Web information according to certain rules.
  • Other less commonly used names include ant, auto-index, simulator or worm.
  • the object of the present invention is to provide a service packaging method based on web page segmentation and search algorithm.
  • the present invention greatly increases the efficiency of acquiring data by a user.
  • the present invention provides the following technical solution:
  • a service packaging method based on web page segmentation and search algorithm comprising following steps:
  • a service extraction stage comprising dynamic packaging and/or static packaging; for dynamic packaging, parsing a dynamic web page, tagging forms that possibly exist in parsed dynamic form information, and tagging and defining, by a user, desired forms among the forms that possibly exist; for static packaging, parsing a static web page, blocking and tagging parsed static forms, and selecting and defining, by the user, desired blocks, and filling in a name, description information and an extraction rule of a service;
  • a service calling stage inputting, by the user, related information for calling a service, and generating, by a back end system, a respective service according to the received related information for calling the service and according to an extraction rule, and returning the service to a front end.
  • the present invention provides a service packaging method based on web page segmentation and search algorithm, which can automatically analyze the page, and can package the web page into a service through a module by which the packaging of a web page can be completed with only several clicks and a small amount of input, generate crawler rules, and return the corresponding structured data according to the user's requirements, which greatly improves the efficiency of data acquisition by a user.
  • FIG. 1 is an implementation framework of a service packaging method based on web page segmentation and search algorithm and system provided by the present invention.
  • FIG. 2 is a user flow chart of a service packaging method based on web page segmentation and search algorithm and system provided by the present invention.
  • the present invention provides a service packaging method based on web page segmentation and search algorithm, comprising a service extraction stage and a service calling stage.
  • the service extraction is a module by which the packaging of a web page can be completed by a packager with several clicks and a small amount of input, which can package a web page into a service.
  • the service calling refers to a calling to a packaged service and provides several parameters to satisfy input and screening requirements. These parameters include uniform parameters, and include specific parameters generated by different web pages,
  • the present invention provides a service packaging method based on web page segmentation and search algorithm, comprising following processes:
  • Service extraction mainly contains two functions: static packaging and dynamic packaging, that is, static packaging of web pages in which data is directly presented; dynamic packaging of web pages that require input of certain query content and require the clicking of a button to present data.
  • service extraction only requires the user to click a few times and fill in service description information.
  • Service extraction implicitly performs two different extraction rules based on the web page. For static web pages in which the data directly presented in the web page, static web page extracting will be performed; for dynamic web pages in which the user needs to performing inputting and clicking to present data, dynamic web page extracting will be performed.
  • the dynamic packaging comprises the following steps:
  • log information is in the format of Logging+timestamp ⁇ +what is being processed. If the last line is successful, it is 200+ json file address of webpage table rule+webpage screenshot address; If the last line is unsuccessful, it is 503+Procedure failed, please retry!, such as:
  • the static packaging comprises the following steps:
  • parent_id is used for background identification when the element is queried in the service calling stage.
  • img comprises image information in the img tag and background image information in the css, which needs to be extracted according to different types of extraction rule.
  • the extraction rule is as follows: if the image is an image with the img tag, then the link address of the img tag will be extracted. If the image is the link information of the background image, then it is necessary to search the background-img attribute in the css attribute of the element where the image is located, and then extract the corresponding link.
  • the background file can parse the page according to this parsing rule.
  • the service generation background needs to store the service information and the extraction rule defined by the user in its own database and query the service information again when the service is called.
  • the generated service API address is: URL://call_service/79, which indicates that the generated service ID is 79.
  • This interface complies with restAPI specification, so the user can use this interface to query, call, delete, and modify the service information.
  • the service calling refers to a calling to a packaged service and provides several parameters to satisfy input and screening requirements. These parameters include uniform parameters, and include specific parameters generated by different web pages,
  • the user can call the corresponding service by checking service information and writing RestFul API.
  • the query parameter is each of the form input options so as to perform the input query; meanwhile, it also includes each of the returned results so that the user can perform screening on the returned results according to the class parameter.
  • the maximum number of pages parameter is _max_page to solve the pagination problem in the web page, the default is 5 pages of data.
  • the url called by the user is:
  • _max_page is the maximum number of pages in the system.
  • Weidu1 and Jingdu2 are the input query parameters of the form; Magnitude (M) and link_5. reference position is the returned result parameter of the service, “link5.reference position” refers to the reference position parameter under link5.
  • the supported input box type is the element form tag supported by HTML5, such as:
  • the packing method tries to: analyze any type of web pages, and automatically parse out the main possible information that may exist in the page; then parse out each format of each block after blocking the information, wherein after a simple modification by the user, this page can be converted into a calling service that can be called directly; and then returns formatted and structured data that the user needs.
  • the present invention provides dynamic form query function. If a dynamic form exists in the page, the form query box can be converted into query parameter for the use of the user. Compared with the traditional crawler, the present invention can automatically analyze the page, generate crawler rules, and return the corresponding structured data according to the user's requirements. Therefore, the present invention greatly increases the efficiency of acquiring data by a user.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention provides a service packaging method based on web page segmentation and search algorithm, comprising the following steps: a service extraction stage, comprising dynamic packaging and/or static packaging; for dynamic packaging, parsing a dynamic web page, tagging forms that possibly exist in parsed dynamic form information, and tagging and defining, by a user, desired forms among the forms that possibly exist; for static packaging, parsing a static web page, blocking and tagging parsed static forms, and selecting and defining, by the user, desired blocks, and filling in a name, description information and an extraction rule of a service; and a service calling stage, comprising inputting, by the user, related information for calling a service, and generating, by a back end system, a corresponding service according to the received related information for calling the service and according to the extraction rule, and returning the corresponding service to a front end. The present invention greatly increases the efficiency of acquiring data by a user.

Description

This is a U.S. national stage application of PCT Application No. PCT/CN2019/118991 under 35 U.S.C. 371, filed Nov. 15, 2019 in Chinese, claiming priority to Chinese Patent Applications No. 201910447448.0, filed May 27, 2019, all of which are hereby incorporated by reference.
FIELD OF TECHNOLOGY
This present invention relates to the technical field of service computing, and in particular, to a service packaging method based on web page segmentation and search algorithm.
BACKGROUND
With the development of Internet, service providers tend to display their service data through web pages. However, various web pages which provide convenience restrict the use of these source data by developers. A service packaging system is intended to package the data in web pages into a service, and provide RestFul API for calling the service to use the service in the development process for developers.
Web page block segmentation technology is the analysis and processing of existing Web page documents, specifically is the technology that the whole Web page is segmented into multiple blocks containing information data, so as to achieve advertisement removal, main information extraction and other functions, which mainly include page block segmentation technology based on node entropy, page block segmentation technology based on visual features, Web page block segmentation technology based on content distance etc. The Web page block segmentation technology has been widely used in various fields of the Internet industry.
A service is a collection of API with multiple attributes that belong to a specific service class which is provided by a developer or a class of developers.
API is certain predefined function designed to provide applications and developers with the ability to access a set of routines based on a certain software or hardware without having to access the source code or understand the details of the inner workings. API has multiple input and output attributes, belonging to a specific developer, and being subordinate to a specific service.
Web crawler (also known as web spider, web robot, and more commonly as web chaser in the FOAF community) is a program or a script that automatically crawl the World Wide Web information according to certain rules. Other less commonly used names include ant, auto-index, simulator or worm.
SUMMARY OF THE INVENTION
The object of the present invention is to provide a service packaging method based on web page segmentation and search algorithm. The present invention greatly increases the efficiency of acquiring data by a user.
To realize the object of the present invention, the present invention provides the following technical solution:
a service packaging method based on web page segmentation and search algorithm, comprising following steps:
a service extraction stage, comprising dynamic packaging and/or static packaging; for dynamic packaging, parsing a dynamic web page, tagging forms that possibly exist in parsed dynamic form information, and tagging and defining, by a user, desired forms among the forms that possibly exist; for static packaging, parsing a static web page, blocking and tagging parsed static forms, and selecting and defining, by the user, desired blocks, and filling in a name, description information and an extraction rule of a service;
and a service calling stage, inputting, by the user, related information for calling a service, and generating, by a back end system, a respective service according to the received related information for calling the service and according to an extraction rule, and returning the service to a front end.
The present invention provides a service packaging method based on web page segmentation and search algorithm, which can automatically analyze the page, and can package the web page into a service through a module by which the packaging of a web page can be completed with only several clicks and a small amount of input, generate crawler rules, and return the corresponding structured data according to the user's requirements, which greatly improves the efficiency of data acquisition by a user.
BRIEF DESCRIPTION OF THE DRAWINGS
In order to explain the technical solutions of the embodiments of the present invention more clearly, the following will briefly introduce the drawings that need to be used in the embodiments of the present invention. Obviously, the drawings described below are only some embodiments of the present invention. For those of ordinary skill in the art, without creative work, other drawings can be obtained based on the drawings.
FIG. 1 is an implementation framework of a service packaging method based on web page segmentation and search algorithm and system provided by the present invention.
FIG. 2 is a user flow chart of a service packaging method based on web page segmentation and search algorithm and system provided by the present invention.
DESCRIPTION OF THE EMBODIMENTS
For better understanding of the purpose, technical solutions and advantages and of the present invention, the following is a further detailed description of the present invention in combination with the attached drawings and embodiments. It should be understood that the embodiments described herein are intended only to explain the present invention and do not limit the claimed scope of the present invention.
The present invention provides a service packaging method based on web page segmentation and search algorithm, comprising a service extraction stage and a service calling stage. The service extraction is a module by which the packaging of a web page can be completed by a packager with several clicks and a small amount of input, which can package a web page into a service. The service calling refers to a calling to a packaged service and provides several parameters to satisfy input and screening requirements. These parameters include uniform parameters, and include specific parameters generated by different web pages,
Taking the page http://www.ceic.ac.cn/history as an example, the running rule of the service packaging method based on web page segmentation and search algorithm is explained.
As shown in FIG. 1 and FIG. 2 , the present invention provides a service packaging method based on web page segmentation and search algorithm, comprising following processes:
Stage 1: Service Extraction
Service extraction mainly contains two functions: static packaging and dynamic packaging, that is, static packaging of web pages in which data is directly presented; dynamic packaging of web pages that require input of certain query content and require the clicking of a button to present data.
For a user, service extraction only requires the user to click a few times and fill in service description information. Service extraction implicitly performs two different extraction rules based on the web page. For static web pages in which the data directly presented in the web page, static web page extracting will be performed; for dynamic web pages in which the user needs to performing inputting and clicking to present data, dynamic web page extracting will be performed.
The dynamic packaging comprises the following steps:
    • S1-1: parsing a dynamic web page, specifically comprising:
      • S1-1-1: filling in, by a user himself or herself, a URL address, the address being any web link accessible by Internet;
      • S1-1-2: using crawler technology to crawl a source code of a web page corresponding to the URL address, wherein a crawler tool of the crawler technology is Selenium+BeautifulSoup+Pyquery in Python3.6.
      • S1-1-3: judging whether there is a <form> tag on a search page, converting the source code of the web page into a structured class data, and searching the form tag in the class data, and tagging it, wherein the tag information is a file in a json format, and an example thereof is shown below:
{
 “ur1”: “http://www.ceic.ac.cn/history”,
 “form_check”: 0,
 “main_form_index”: 0,
 “forms”: [
  {
   “id”: 0,
   “main_btn_index”: 0,
   “len”: 11,
   “css_selector”: “html > body:nth-child(2) > div > div:nth-child(3) > div > div >
div:nth-child(2) > form”,
   “input_list”: [
    {
     “id”: 0,
     “type”: “text”,
      “name”: “start”,
      “required”: false,
      “css_selector” : “html > body:nth-child(2) > div > div:nth-child(3) >
div > div > div:nth-child(2) > form > div > input:nth-child(2)”,
      “value”: “”,
      “query_name”: “start”,
      “description”: “”,
      “index”: “T1”
    },
   ],
   “submit_button_list”: [
    {
      “id”: 0,
      “type”: “a”,
      “css_selector”: “html > body:nth-child(2) > div > div:nth-child(3) >
div > div > div:nth-child(2) > form > div:nth-child(5) > a:nth-child(4)”,
      “index”: “b1”
    }
   ]
  }
 ]
}
And wherein the Parsing of the json file is as follows:
    • *url is a web address.
    • *form_check is 1/0, if the form check is 0, than the form is not used for checking, otherwise if the form check is 1, then the form is used for checking.
    • *main_form_index is the index number of the forms array selected by default below, used in conjunction with form_check.
    • *forms is an array where all forms in the page are stored (normally there is only one form in the page, but all forms are retrieved here, with the first being the default). Finally, the front end modifies the value of “main_form_index” according to the user's actual selection.
Each element in the *forms is parsed as follows, such as forms[0]:
    • *id and len are auxiliary information.
    • *All the css_selector are selectors for the form.
    • *main_btn_index is the index number of the main button in submit_button_list, which can be changed according to the user's selection.
    • *input_list is the parameter list of the form and also the input parameter list of the background api, such as input_list[0]:
    • *query_name is the query name which can be modified by the user and is a name used by our API.
    • *required is an indication that must be entered by the user.
    • *value is a default item of a value input by the user and can be modified by the user.
      • S1-1-4: constantly printing out parsed log information in a GUI display background.
Wherein the log information is in the format of Logging+timestamp→+what is being processed. If the last line is successful, it is 200+ json file address of webpage table rule+webpage screenshot address; If the last line is unsuccessful, it is 503+Procedure failed, please retry!, such as:
    • Logging 2018-11-26 20:23:14.572181→Crawl HTML Document from http://www.ceic.ac.cn/history, please wait for about 40 s(max 60 s)
    • Logging 2018-11-26 20:23:18.772222→Finding form table on http://www.ceic.ac.cn/history
    • Logging 2018-11-26 20:23:18.974834→Finished on http://www.ceic.ac.cn/history 200http://183.129.170.180:18088/statics/form_test1/form_list.json http://183.129.170.180:18088/static/form_test1/form_seg_shot.png
      • S1-1-5: using image processing technique (PIL library in Python) to tag all form information that possibly exist in the page, and position of each input box and a possible submit button in each form.
Wherein the position of the tagged element need to be obtained during the process of tagging, and JavaScript's getBoundingClientRect function is used here to obtain the width and height of the element and the position thereof with respect to the image.
    • S1-2: selecting, by the user, a form and defining input parameter information, specifically comprising:
      • S1-2-1: independently selecting, by the user, whether the user himself or herself needs to use the form, if he or she does, selecting the form number, and if he or she doesn't, skipping this step;
      • S1-2-2: independently defining, by the user, a name and an example value of each input box, and selecting a number of the submit button;
      • S1-2-3: submitting the information modified by the user to the background, and generating, by the background, a form extraction rule based on the information.
The static packaging comprises the following steps:
    • S1-3: parsing a static web page, specifically comprising:
      • S1-3-1: using crawler technology to crawl a source code of a web page corresponding to the URL address.
      • S1-3-2: using a breadth-first search algorithm to search all items that possibly exist in the page.
    • Wherein, the breadth-first search algorithm is as follows: generating a DOM tree structure of the page, creating a traversal sequence list, putting HTML nodes in the list, traversing the list sequentially, putting child nodes of each node at end of the list until all the nodes are traversed.
      • S1-3-3: using page segmentation algorithm to merge all items with the same structure into a block.
      • Wherein, the web page segmentation algorithm is: calculating the label paths of all nodes and comparing the label paths with the label paths of brother nodes thereof. If the two label paths are the same, the two nodes are in the same block. The algorithm merges all nodes with the same label path into the same block.
      • S1-3-4: using a weighted sorting algorithm to screen out at most 10 largest blocks.
    • Wherein, the weighted sorting algorithm is as follows: sorting first 15 blocks according to numbers of list items in each block from large to small; sorting first 15 blocks according to a block size of each block from large to small; selecting intersection of the two lists and selecting first 10 blocks as a largest block finally selected.
      • S1-3-5: using image processing technology to tag the screened blocks.
      • S1-3-6: constantly printing out parsed log information in a GUI display background. Wherein the log information is in the format of Logging+timestamp→+what is being processed. If the last line is successful, it is 200+ json file address of webpage table rule+webpage screenshot address; If the last line is unsuccessful, it is 503+Procedure failed, please retry!, such as:
    • 2018-11-26 13:31:21.723678→Crawl HTML Document from http://www.ceic.ac.cn/history, please wait for about 40 s(max 60 s)
    • 2018-11-26 13:31:25.826979→Handling the input forms on http://www.ceic.ac.cn/history
    • 2018-11-26 13:31:31.362985→Run Pruning on http://www.ceic.ac.cn/history
    • 2018-11-26 13:31:31.365060→Run Partial Tree Matching on http://www.ceic.ac.cn/history
    • 2018-11-26 13:31:31.380825→Run Backtracking on http://www.ceic.ac.cn/history
    • 2018-11-26 13:31:31.381136→Merging and generating rules and blocks on http://www.ceic.ac.cn/history,process:1%
    • 2018-11-26 13:31:31.381136→Merging and generating rules and blocks on http://www.ceic.ac.cn/history,process:50%
    • 2018-11-26 13:31:43.596304→Merging and generating rules and blocks on http://www.ceic.ac.cn/history, process:99%
    • 2018-11-26 13:31:43.672126→Selecting the main segment on http://www.ceic.ac.cn/history
    • 2018-11-26 13:31:44.780230→Generating sections on http://www.ceic.ac.cn/history
    • 2018-11-26 13:31:45.128704→Generating api_info on http://www.ceic.ac.cn/history
    • 2018-11-26 13:31:45.183179→Generating rules on http://www.ceic.ac.cn/history
    • 2018-11-26 13:31:45.183506→Output Result JSON File on http://www.ceic.ac.cn/history
    • 2018-11-26 13:31:45.194587→Finished on http://www.ceic.ac.cn/history
    • 200 http://183.129.170.180:18088/statics/15432390785389/api_info.json
    • S1-4: selecting, by the user, blocks and defining input parameter information.
      • S1-4-1: independently selecting, by the user, a number of the block desired by the user himself or herself.
      • S1-4-2: defining, by the user, a name and description of data number in the blocks automatically analyzed by a system, and judging whether the blocks are desired.
      • S1-4-3: filling in, by the user, a name and description information of a to-be-generated service.
      • S1-4-4: submitting, by the system, the service information modified by the user and the extraction rule of each item to a service generation background in a json format.
    • Wherein the extraction rule algorithm is as follows: starting from THE HTML tag according to the DOM tree structure, using > to represent the progressive relationship and using NTH-child (I) to represent the i th node of the layer, thereby generating the extraction rule in the form of css_selector.
The following is an example of a service information file:
{
 “api_name”: “China Earthquake Network -- historical query”,
 “form_rules_link”: “http://183.129.170.180:18088/statics/form_list.json”,
 “api_description”: “China Earthquake Network -- historical query”,
 “api_url”: “http://www.ceic.ac.cn/history”,
 “img_link”: “http://183.129.170.180:18088/static/test/seg_shot.png”,
 “api_crawl_rules_link”: “http://183.129.170.180:18088/statics/test/rules_list.json”,
 “json_link”: “http://183.129.170.180:18088/statics/test/example.json”,
 “main_sec_id”: 0,
 “sections”: [
  “Magnitude (M) seismogenic time (UTC+8) Latitude (°) Longitude (°) depth (km)
Reference position”,
 ],
 “candidate”: [
  [
   {
    “id”: 0,
    “name”: “text_0”,
    “description”: “text_description”,
    “type”: “text”,
    “example”: “Latitude: ”,
    “select”: 1,
    “parent_id”: −1
   },
   {
    “id”: 1,
    “name”: “text_1”,
    “description”: “text_description”,
    “type”: “text”,
    “example”: “greater than”,
    “select”: 1,
    “parent_id”: −1
   },
  ],
 ]
}
The format of the file is described as follows:
    • *api_name is an api name defined by the user.
    • *api_description is an api description defined by the user.
    • *img_link field is the link address of the main part of the image captured.
    • *api_crawl_rules_link field is the json file address of the crawler rule.
    • *json_link field is the largest file address by default.
    • *main_sec_id is the default index value of the subject block. It is usually 0, but the value thereof can be changed according to the user's actual requirements.
    • *sections is an array of strings, wherein each item is the literal content of each block. sections [main_sec_id] is directly indexed to obtain the literal content of the block.
    • *api_info [“candidate”] is a two-dimensional array with each item containing element information in each topic block. Element information is parsed as follows:
{
 “id”: 0, //Element ID information
 “name”: “text_0”, //Element name
 “description”: “text_description”, //Element description
 “type”: “text”, //Element type
 “example”: “latitude: ”,//Element example value
 “select”: 1,//Whether to output this element
 “parent_id”: −1//ID of the parent element of the element
},
Wherein there are three types of “type”: text, img and link respectively representing text, picture and hyperlink types. parent_id is used for background identification when the element is queried in the service calling stage.
An example and description of a crawler rule JSON file are as follows:
[{
 “record_id”: 0,
 “texts”: [{/Rule description for all text information
  “id”: 0,//Corresponding to name in the api parameter list, unique
  “css_selector”: “”//Position of the text tag, which can be converted according to the
rank below
  “rank”:0//Number of position of the text in the node
 }],
 “images”: [{//Rule description for all image information
  “id”: “img_1”,
  “type”: “img/back_img”, //img or backgroud_img is defined according to the image
form
  “css_selector”: “”//Select src, alt, title according to this value (the latter two
attributes may not exist)
 }],
 “links”: [{//Rule description for all link information
  “id”: 0,
  “css_selector”: “”,//Obtain href
  “texts”: [{//Text information description in link information
   “id”: 0,
   “css_selector”: “”,//Obtain href
   “rank”: 0
}],
“images”: [{//Image information description in link information
   “id”: 0,
   “type”: “img/back_img”,
   “css_selector”: “”
  }]
 }]
},
{
 “record_id”: 1,//Every record has a unique css_selector
}
]
Wherein the text message needs to be located based on both css_selector and rank information. img comprises image information in the img tag and background image information in the css, which needs to be extracted according to different types of extraction rule. The extraction rule is as follows: if the image is an image with the img tag, then the link address of the img tag will be extracted. If the image is the link information of the background image, then it is necessary to search the background-img attribute in the css attribute of the element where the image is located, and then extract the corresponding link.
The background file can parse the page according to this parsing rule.
    • S1-5: generating the service.
      • S1-5-1: parsing, by the service generation background, the service information and the extraction rule information, and checking fault tolerance.
      • S1-5-2: generating, by the service generation background, a service desired by the user and an address and a query parameter corresponding to calling the service, for waiting for calling.
Wherein, the service generation background needs to store the service information and the extraction rule defined by the user in its own database and query the service information again when the service is called.
In this example, the generated service API address is: URL://call_service/79, which indicates that the generated service ID is 79. This interface complies with restAPI specification, so the user can use this interface to query, call, delete, and modify the service information.
Stage 2: Service Calling
The service calling refers to a calling to a packaged service and provides several parameters to satisfy input and screening requirements. These parameters include uniform parameters, and include specific parameters generated by different web pages,
The user can call the corresponding service by checking service information and writing RestFul API.
The specific steps are as follows:
    • S2-1: filling in, by the user, a query parameter specified by the service, and calling API.
Wherein the query parameter is each of the form input options so as to perform the input query; meanwhile, it also includes each of the returned results so that the user can perform screening on the returned results according to the class parameter. Meanwhile, the maximum number of pages parameter is _max_page to solve the pagination problem in the web page, the default is 5 pages of data.
For example, the url called by the user is:
 url://call_service/79?__max_page = 7 & weidu1 = 30 & jingdu2 = 20 & magnitude (M) =
3.5 & link_5.reference position = Gengma County, Lincang City in Yunnan Province
Wherein, _max_page is the maximum number of pages in the system. Weidu1 and Jingdu2 are the input query parameters of the form; Magnitude (M) and link_5. reference position is the returned result parameter of the service, “link5.reference position” refers to the reference position parameter under link5.
The meaning of the above link is: capturing the paging content, up to 7 pages, wherein the input parameter weidu1 value is 30, and jingdu2 value is 20; when outputting, selecting the data with a result of magnitude (M)=3.5, link_5. reference position=Gengma County, Lincang City in Yunnan Province to display.
If Chinese characters exist, then UTF-8 code needs to be used.
    • S2-2: opening, by a calling background, a real URL address corresponding to the API by using crawler technology according to an address of the API called by the user.
    • S2-3: deciding, by the calling background, whether to fill in and query the form information according to the user's selection upon packaging the service.
Wherein if the user needs to fill in the form information, the user can fill in the form content according to the query parameter value input by the user. The supported input box type is the element form tag supported by HTML5, such as:
    • text,number,email,password,textarea,radio,checkbox,select,datalist,button,submit,rest.
    • S2-4: using, the calling background, crawler technology to crawl a source code of the page after the form is processed.
    • S2-5: extracting, by the system, related items in the page according to the stored extraction rule information, and performing structural conversion and generating a returned result according to a name and parameters of the returned result defined by the user.
    • S2-6: performing screening, by the calling background, on the returned result according to the query parameter of the user.
    • S2-7: returning, by the system, a calling result to the front end.
Wherein an example of the returned result is as follows:
 [{“ link_5 ”: {“ href” : “http://news.ceic.ac.cn/CD20181127064153.html”, “reference
position” : “Gengma county, Lincang city in Yunnan Province”}, “record_id” : 0, “seismogenic
time (UTC + 8)” : “2018-11-27 06:41:53 depth (km) ”:“ ”, “8”, “latitude (°)” : “23.49”,
“longitude (°)” : “99.46”, “magnitude (M)” : “3.5”}]
The service packaging method based on web page segmentation and search algorithm has been described above, the packing method tries to: analyze any type of web pages, and automatically parse out the main possible information that may exist in the page; then parse out each format of each block after blocking the information, wherein after a simple modification by the user, this page can be converted into a calling service that can be called directly; and then returns formatted and structured data that the user needs. Meanwhile, the present invention provides dynamic form query function. If a dynamic form exists in the page, the form query box can be converted into query parameter for the use of the user. Compared with the traditional crawler, the present invention can automatically analyze the page, generate crawler rules, and return the corresponding structured data according to the user's requirements. Therefore, the present invention greatly increases the efficiency of acquiring data by a user.
The foregoing is merely illustrative of the preferred embodiments of the present invention and it should be understood that the embodiments described above are only the most preferable embodiments of the present invention and are not intended to be limiting of the present invention, and various changes and modifications may be made by those skilled in the art. Any modifications, equivalent substitutions, improvements, and the like within the spirit and principles of the present invention are intended to be included within the scope of the present invention.

Claims (5)

What is claimed is:
1. A computer-implemented service packaging method based on web page segmentation and search algorithm, comprising following steps executed by a processor:
conducting a service extraction stage comprising dynamic packaging and static packaging, wherein, for dynamic packaging, parsing a dynamic web page, tagging forms that exist in parsed dynamic form information, and tagging and defining, by a user, desired forms among the forms that exist; for static packaging, parsing a static web page, blocking and tagging parsed static forms, and selecting and defining, by the user, desired blocks, and filling in a name, description information and an extraction rule of a service;
conducting a service calling stage, wherein the user inputs related information for calling a service, and generating, by a back end system, a respective service according to the received related information for calling the service and according to an extraction rule, and returning the service to a front end; and
wherein the dynamic packaging comprises following steps:
S1-1: parsing a dynamic web page, specifically comprising:
S1-1-1: filling in, by a user, a Uniform Resource Locator (URL) address, the URL address being any web link accessible by Internet;
S1-1-2: using crawler technology to crawl a source code of a web page corresponding to the URL address;
S1-1-3: judging whether there is a <form> tag in a search page, converting the source code of the web page into a structured class data, and searching the <form> tag in the class data, and tagging the <form> tag;
S1-1-4: constantly printing out parsed log information in a Graphical User Interface (GUI) display background;
S1-1-5: using image processing technique to tag all form information that exist in the dynamic web page, and location of each input box and a submit button in each form;
S1-2: selecting, by the user, a form and defining input parameter information, specifically comprising:
S1-2-1: independently selecting, by the user, whether the user needs to use the form, if the user does, selecting the form number, and if the user doesn't, skipping this step;
S1-2-2: independently defining, by the user, a name and a sample value of each input box, and selecting a number of the submit button; and
S1-2-3: submitting the information modified by the user to a background, and generating, by the background, a form extraction rule based on the information; and
wherein, the static packaging comprises:
S1-3: parsing a static web page, specifically comprising:
S1-3-1: using crawler technology to crawl a source code of a web page corresponding to a Uniform Resource Locator (URL) address;
S1-3-2: using a breadth-first search algorithm to search all items that exist in the static web page;
S1-3-3: using page segmentation algorithm to merge all items with the same structure into a block;
S1-3-4: using a weighted sorting algorithm to screen out at most 10 largest blocks;
S1-3-5: using image processing technology to tag the screened blocks;
S1-3-6: constantly printing out parsed log information in a Graphical User Interface (GUI) display background;
S1-4: selecting, by the user, the screened blocks and defining input parameter information;
S1-4-1: independently selecting, by the user, a number of the screened blocks desired by the user;
S1-4-2: defining, by the user, a name and description of data number in the screened blocks automatically analyzed by a system, and judging whether the screened blocks are desired;
S1-4-3: filling in, by the user, a name and description information of a to-be-generated service;
S1-4-4: submitting, by the system, the service information modified by the user and the extraction rule of each item to a service generation background in a JSON format;
S1-5: generating a service;
S1-5-1: parsing, by the service generation background, the service information and the extraction rule information, and checking fault tolerance;
S1-5-2: generating, by the service generation background, the service desired by the user, and an address, and a query parameter corresponding to calling the service, for waiting for calling;
wherein, a crawler tool of the crawler technology is Selenium+BeautifulSoup+Pyquery in Python™3.6.
2. The computer-implemented service packaging method according to claim 1, wherein, the breadth-first search algorithm is as follows:
generating a Document Object Model (DOM) tree structure of the web page, creating a traversal sequence list, putting HyperText Markup Language(HTML) nodes in the traversal sequence list, traversing the traversal sequence list sequentially, putting child nodes of each node at end of the list until all the nodes are traversed.
3. The computer-implemented service packaging method according to claim 1, wherein, the weighted sorting algorithm is as follows: sorting a first 15 blocks to create a first list according to numbers of list items in each block from large to small; sorting another first 15 blocks to create a second list according to a block size of each block from large to small; and selecting intersection of the first and second lists and selecting first 10 blocks as a largest block finally selected.
4. The computer-implemented service packaging method according to claim 1, wherein, a specific process of the service calling stage is:
S2-1: filling in, by the user, a query parameter specified by the service, and calling Application Programming Interface (API);
S2-2: opening, by a calling background, a real Uniform Resource Locator (URL) address corresponding to the API by using crawler technology according to an address of the API called by the user;
S2-3: deciding, by the calling background, whether to fill in and query the form information according to the user's selection upon packaging the service;
S2-4: using, by the calling background, crawler technology to crawl a source code of the web page after the form is processed;
S2-5: extracting, by the system, related items in the page according to the stored extraction rule information, and performing structural conversion and generating a returned result according to a name and parameters of the returned result defined by the user;
S2-6: performing screening, by the calling background, on the returned result according to the query parameter of the user;
S2-7: returning, by the system, a calling result to the front end.
5. A computer-implemented service packaging method based on web page segmentation and search algorithm, comprising following steps executed by a processor:
conducting a service extraction stage comprising dynamic packaging and static packaging, wherein, for dynamic packaging, parsing a dynamic web page, tagging forms that exist in parsed dynamic form information, and tagging and defining, by a user, desired forms among the forms that exist; for static packaging, parsing a static web page, blocking and tagging parsed static forms, and selecting and defining, by the user, desired blocks, and filling in a name, description information and an extraction rule of a service;
conducting a service calling stage, wherein the user inputs related information for calling a service, and generating, by a back end system, a respective service according to the received related information for calling the service and according to an extraction rule, and returning the service to a front end; and
wherein the dynamic packaging comprises following steps:
S1-1: parsing a dynamic web page, specifically comprising:
S1-1-1: filling in, by a user, a Uniform Resource Locator (URL) address, the URL address being any web link accessible by Internet;
S1-1-2: using crawler technology to crawl a source code of a web page corresponding to the URL address;
S1-1-3: judging whether there is a <form> tag in a search page, converting the source code of the web page into a structured class data, and searching the <form> tag in the class data, and tagging the <form> tag;
S1-1-4: constantly printing out parsed log information in a Graphical User Interface (GUI) display background;
S1-1-5: using image processing technique to tag all form information that exist in the dynamic web page, and location of each input box and a submit button in each form;
S1-2: selecting, by the user, a form and defining input parameter information, specifically comprising:
S1-2-1: independently selecting, by the user, whether the user needs to use the form, if the user does, selecting the form number, and if the user doesn't, skipping this step;
S1-2-2: independently defining, by the user, a name and a sample value of each input box, and selecting a number of the submit button; and
S1-2-3: submitting the information modified by the user to a background, and generating, by the background, a form extraction rule based on the information; and
wherein, the static packaging comprises:
S1-3: parsing a static web page, specifically comprising:
S1-3-1: using crawler technology to crawl a source code of a web page corresponding to a Uniform Resource Locator (URL) address;
S1-3-2: using a breadth-first search algorithm to search all items that exist in the static web page;
S1-3-3: using page segmentation algorithm to merge all items with the same structure into a block;
S1-3-4: using a weighted sorting algorithm to screen out at most 10 largest blocks;
S1-3-5: using image processing technology to tag the screened blocks;
S1-3-6: constantly printing out parsed log information in a Graphical User Interface (GUI) display background;
S1-4: selecting, by the user, the screened blocks and defining input parameter information;
S1-4-1: independently selecting, by the user, a number of the screened blocks desired by the user;
S1-4-2: defining, by the user, a name and description of data number in the screened blocks automatically analyzed by a system, and judging whether the screened blocks are desired;
S1-4-3: filling in, by the user, a name and description information of a to-be-generated service;
S1-4-4: submitting, by the system, the service information modified by the user and the extraction rule of each item to a service generation background in a JSON format;
S1-5: generating a service;
S1-5-1: parsing, by the service generation background, the service information and the extraction rule information, and checking fault tolerance;
S1-5-2: generating, by the service generation background, the service desired by the user, and an address, and a query parameter corresponding to calling the service, for waiting for calling;
wherein, a crawler tool of the crawler technology is Selenium+BeautifulSoup+Pyquery in Python™3.6;
wherein, the breadth-first search algorithm is as follows: generating a Document Object Model (DOM) tree structure of the web page, creating a traversal sequence list, putting HyperText Markup Language(HTML) nodes in the traversal sequence list, traversing the traversal sequence list sequentially, putting child nodes of each node at end of the list until all the nodes are traversed;
wherein, the weighted sorting algorithm is as follows: sorting a first 15 blocks to create a first list according to numbers of list items in each block from large to small; sorting another first 15 blocks to create a second list according to a block size of each block from large to small; and selecting intersection of the first and second lists and selecting first 10 blocks as a largest block finally selected; and
wherein, a specific process of the service calling stage is:
S2-1: filling in, by the user, a query parameter specified by the service, and calling Application Programming Interface (API);
S2-2: opening, by a calling background, a real Uniform Resource Locator (URL) address corresponding to the API by using crawler technology according to an address of the API called by the user;
S2-3: deciding, by the calling background, whether to fill in and query the form information according to the user's selection upon packaging the service;
S2-4: using, by the calling background, crawler technology to crawl a source code of the web page after the form is processed;
S2-5: extracting, by the system, related items in the page according to the stored extraction rule information, and performing structural conversion and generating a returned result according to a name and parameters of the returned result defined by the user;
S2-6: performing screening, by the calling background, on the returned result according to the query parameter of the user;
S2-7: returning, by the system, a calling result to the front end.
US17/614,978 2019-05-27 2019-11-15 Service packaging method based on web page segmentation and search algorithm Active 2040-08-18 US12050652B2 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201910447448.0A CN110222251B (en) 2019-05-27 2019-05-27 Service packaging method based on webpage segmentation and search algorithm
CN201910447448.0 2019-05-27
PCT/CN2019/118991 WO2020238070A1 (en) 2019-05-27 2019-11-15 Web page segmentation and search algorithm-based service packaging method

Publications (2)

Publication Number Publication Date
US20220245203A1 US20220245203A1 (en) 2022-08-04
US12050652B2 true US12050652B2 (en) 2024-07-30

Family

ID=67818447

Family Applications (1)

Application Number Title Priority Date Filing Date
US17/614,978 Active 2040-08-18 US12050652B2 (en) 2019-05-27 2019-11-15 Service packaging method based on web page segmentation and search algorithm

Country Status (3)

Country Link
US (1) US12050652B2 (en)
CN (1) CN110222251B (en)
WO (1) WO2020238070A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222251B (en) * 2019-05-27 2022-04-01 浙江大学 Service packaging method based on webpage segmentation and search algorithm
CN112381297A (en) * 2020-11-16 2021-02-19 国家电网公司华中分部 Method for predicting medium-term and long-term electricity consumption in region based on social information calculation
CN112347332A (en) * 2020-11-17 2021-02-09 南开大学 XPath-based crawler target positioning method
CN112565437B (en) * 2020-12-07 2021-11-19 浙江大学 Service caching method for cross-border service network
CN112818200A (en) * 2021-01-28 2021-05-18 平安普惠企业管理有限公司 Data crawling and event analyzing method and system based on static website
CN119621060A (en) * 2025-02-12 2025-03-14 沈阳二一三电子科技有限公司 A packaging method, device, equipment and storage medium for conditional search components

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070055656A1 (en) * 2005-08-01 2007-03-08 Semscript Ltd. Knowledge repository
CN101004760A (en) 2007-01-10 2007-07-25 苏州大学 Method for extracting page query interface based on character of vision
US20090171999A1 (en) * 2007-12-27 2009-07-02 Cloudscale Inc. System and Methodology for Parallel Stream Processing
CN101515287A (en) 2009-03-24 2009-08-26 崔志明 Automatic generating method of wrapper of complex page
US20110296291A1 (en) * 2007-11-15 2011-12-01 Olya Melkinov System and method for transforming documents for publishing electronically
US20110321160A1 (en) * 2010-06-24 2011-12-29 Mcafee, Inc. Systems and methods to detect malicious media files
WO2013016139A1 (en) 2011-07-22 2013-01-31 Alibaba Group Holding Limited Configuring web crawler to extract web page information
CN103034690A (en) 2012-11-28 2013-04-10 华南理工大学 Self-customizing method of mobile terminal client application program based on web service
US20150193402A1 (en) * 2014-01-09 2015-07-09 International Business Machines Corporation Tracking javascript actions
CN105516337A (en) 2015-12-28 2016-04-20 南京大学金陵学院 Web site docking analysis method based on dynamic loading mechanism
US9881323B1 (en) * 2007-06-22 2018-01-30 Twc Patent Trust Llt Providing hard-to-block advertisements for display on a webpage
US20190034441A1 (en) * 2016-09-23 2019-01-31 Hvr Technologies Inc. Digital communications platform for webpage overlay
US10521496B1 (en) * 2014-01-03 2019-12-31 Amazon Technologies, Inc. Randomize markup to disturb scrapers
US10534851B1 (en) * 2014-12-19 2020-01-14 BloomReach Inc. Dynamic landing pages
US20200110781A1 (en) * 2018-10-09 2020-04-09 Voluware, Inc. Interactive website automation for health care web portals with random content
US20210049234A1 (en) * 2019-08-15 2021-02-18 Kumar ANIL Web Element Rediscovery System and Method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104281714A (en) * 2014-10-29 2015-01-14 南通大学 Hospital portal website clinic specialist information extracting system
CN104899551B (en) * 2015-04-30 2018-08-14 北京大学 A kind of form image sorting technique
CN107391675B (en) * 2017-07-21 2021-03-09 百度在线网络技术(北京)有限公司 Method and apparatus for generating structured information
CN110222251B (en) * 2019-05-27 2022-04-01 浙江大学 Service packaging method based on webpage segmentation and search algorithm

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070055656A1 (en) * 2005-08-01 2007-03-08 Semscript Ltd. Knowledge repository
CN101004760A (en) 2007-01-10 2007-07-25 苏州大学 Method for extracting page query interface based on character of vision
US9881323B1 (en) * 2007-06-22 2018-01-30 Twc Patent Trust Llt Providing hard-to-block advertisements for display on a webpage
US20110296291A1 (en) * 2007-11-15 2011-12-01 Olya Melkinov System and method for transforming documents for publishing electronically
US20090171999A1 (en) * 2007-12-27 2009-07-02 Cloudscale Inc. System and Methodology for Parallel Stream Processing
CN101515287A (en) 2009-03-24 2009-08-26 崔志明 Automatic generating method of wrapper of complex page
US20110321160A1 (en) * 2010-06-24 2011-12-29 Mcafee, Inc. Systems and methods to detect malicious media files
WO2013016139A1 (en) 2011-07-22 2013-01-31 Alibaba Group Holding Limited Configuring web crawler to extract web page information
CN103034690A (en) 2012-11-28 2013-04-10 华南理工大学 Self-customizing method of mobile terminal client application program based on web service
US10521496B1 (en) * 2014-01-03 2019-12-31 Amazon Technologies, Inc. Randomize markup to disturb scrapers
US20150193402A1 (en) * 2014-01-09 2015-07-09 International Business Machines Corporation Tracking javascript actions
US10534851B1 (en) * 2014-12-19 2020-01-14 BloomReach Inc. Dynamic landing pages
CN105516337A (en) 2015-12-28 2016-04-20 南京大学金陵学院 Web site docking analysis method based on dynamic loading mechanism
US20190034441A1 (en) * 2016-09-23 2019-01-31 Hvr Technologies Inc. Digital communications platform for webpage overlay
US20200110781A1 (en) * 2018-10-09 2020-04-09 Voluware, Inc. Interactive website automation for health care web portals with random content
US20210049234A1 (en) * 2019-08-15 2021-02-18 Kumar ANIL Web Element Rediscovery System and Method
US11205041B2 (en) * 2019-08-15 2021-12-21 Anil Kumar Web element rediscovery system and method

Also Published As

Publication number Publication date
WO2020238070A1 (en) 2020-12-03
CN110222251A (en) 2019-09-10
US20220245203A1 (en) 2022-08-04
CN110222251B (en) 2022-04-01

Similar Documents

Publication Publication Date Title
US12050652B2 (en) Service packaging method based on web page segmentation and search algorithm
US9767082B2 (en) Method and system of retrieving ajax web page content
US12475176B2 (en) Automated system and method for creating structured data objects for a media-based electronic document
US9529780B2 (en) Displaying content on a mobile device
US8065667B2 (en) Injecting content into third party documents for document processing
US20090248707A1 (en) Site-specific information-type detection methods and systems
US10261984B2 (en) Browser and operating system compatibility
CN108090104B (en) Method and device for acquiring webpage information
US20170185509A1 (en) Codeless system and tool for testing applications
US9563611B2 (en) Merging web page style addresses
WO2015047920A1 (en) Title and body extraction from web page
US10198408B1 (en) System and method for converting and importing web site content
US20150254350A1 (en) Method for entity enrichment of digital content to enable advanced search functionality in content management systems
US10572566B2 (en) Image quality independent searching of screenshots of web content
US8560518B2 (en) Method and apparatus for building sales tools by mining data from websites
CN112417338B (en) Page adaptation method, system and equipment
CN105868096A (en) Methods and apparatuses used for displaying web page test result in browser and device
US20110197133A1 (en) Methods and apparatuses for identifying and monitoring information in electronic documents over a network
CN106570750A (en) Browser plug-in-based automatic tax declaration method, system and browser plug-in
CN110147477B (en) Data resource modeling extraction method, device and equipment of Web system
US20140143172A1 (en) System, method, software arrangement and computer-accessible medium for a mobile-commerce store generator that automatically extracts and converts data from an electronic-commerce store
CN114398138B (en) Interface generation method, device, computer equipment and storage medium
CN109032937B (en) Data selection method and system based on webpage
US20180067837A1 (en) Framework for detecting source code anomalies
Yu et al. Web content information extraction based on DOM tree and statistical information

Legal Events

Date Code Title Description
AS Assignment

Owner name: ZHEJIANG UNIVERSITY, CHINA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:WANG, NAIBO;LV, XIYA;YANG, ZITONG;AND OTHERS;REEL/FRAME:058231/0511

Effective date: 20211126

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

ZAAA Notice of allowance and fees due

Free format text: ORIGINAL CODE: NOA

ZAAB Notice of allowance mailed

Free format text: ORIGINAL CODE: MN/=.

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCF Information on status: patent grant

Free format text: PATENTED CASE