HK1173821A1 - A method and system for extracting information of a web page - Google Patents

A method and system for extracting information of a web page

Info

Publication number
HK1173821A1
HK1173821A1 HK13101167.0A HK13101167A HK1173821A1 HK 1173821 A1 HK1173821 A1 HK 1173821A1 HK 13101167 A HK13101167 A HK 13101167A HK 1173821 A1 HK1173821 A1 HK 1173821A1
Authority
HK
Hong Kong
Prior art keywords
web page
extracting information
extracting
information
web
Prior art date
Application number
HK13101167.0A
Other languages
English (en)
Chinese (zh)
Inventor
蔡波洋
强琦
Original Assignee
阿里巴巴集團控股有限公司 號郵箱
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集團控股有限公司 號郵箱 filed Critical 阿里巴巴集團控股有限公司 號郵箱
Publication of HK1173821A1 publication Critical patent/HK1173821A1/xx

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • G06F40/143Markup, e.g. Standard Generalized Markup Language [SGML] or Document Type Definition [DTD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/972Access to data in other repository systems, e.g. legacy data or dynamic Web page generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
HK13101167.0A 2011-06-15 2013-01-28 A method and system for extracting information of a web page HK1173821A1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110161113.6A CN102831121B (zh) 2011-06-15 2011-06-15 一种网页信息抽取的方法和系统

Publications (1)

Publication Number Publication Date
HK1173821A1 true HK1173821A1 (en) 2013-05-24

Family

ID=47334264

Family Applications (1)

Application Number Title Priority Date Filing Date
HK13101167.0A HK1173821A1 (en) 2011-06-15 2013-01-28 A method and system for extracting information of a web page

Country Status (7)

Country Link
US (2) US9053206B2 (ja)
EP (1) EP2721517B1 (ja)
JP (2) JP5944985B2 (ja)
CN (1) CN102831121B (ja)
HK (1) HK1173821A1 (ja)
TW (1) TW201250492A (ja)
WO (1) WO2012174137A1 (ja)

Families Citing this family (54)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10055718B2 (en) 2012-01-12 2018-08-21 Slice Technologies, Inc. Purchase confirmation data extraction with missing data replacement
US20140201623A1 (en) * 2013-01-17 2014-07-17 Bazaarvoice, Inc Method and system for determining and using style attributes of web content
CA2816781C (en) 2013-05-28 2022-07-05 Ibm Canada Limited - Ibm Canada Limitee Identifying client states
US9747262B1 (en) * 2013-06-03 2017-08-29 Ca, Inc. Methods, systems, and computer program products for retrieving information from a webpage and organizing the information in a table
WO2015013899A1 (en) * 2013-07-31 2015-02-05 Empire Technology Development Llc Information extraction from semantic data
CN103942309B (zh) * 2014-04-18 2017-06-30 网易乐得科技有限公司 一种网络数据获取设备、方法及获取过程的实现方法
TWI549008B (zh) * 2014-07-30 2016-09-11 Chunghwa Telecom Co Ltd A large number of data into the system and methods of screening management
CN104462540B (zh) * 2014-12-24 2018-03-30 中国科学院声学研究所 网页信息抽取方法
CN104572934B (zh) * 2014-12-29 2016-03-30 西安交通大学 一种基于dom的网页关键内容抽取方法
CN104794168B (zh) * 2015-03-30 2018-06-05 明博教育科技有限公司 一种知识点关联方法及系统
CN106547520B (zh) * 2015-09-16 2021-05-28 腾讯科技(深圳)有限公司 一种代码路径分析方法及装置
CN105095527A (zh) * 2015-09-29 2015-11-25 北京奇虎科技有限公司 基于链接地址的搜索方法及装置
CN105426352A (zh) * 2015-11-24 2016-03-23 国家电网公司 模板文档自动生成方法
CN105677638B (zh) * 2016-01-05 2018-10-09 北京工业大学 Web信息抽取方法
KR101722161B1 (ko) * 2016-01-06 2017-04-03 (주)포그리트 웹 사이트의 사용성 분석 장치 및 이를 이용한 웹 사이트의 사용성 분석 방법
KR101722157B1 (ko) * 2016-01-06 2017-04-03 (주)포그리트 정보 수집 장치 및 이를 이용한 웹 사이트의 정보 수집 방법
CN107807927B (zh) * 2016-09-08 2022-04-29 阿里巴巴(中国)有限公司 基于下发规则的页面解析方法、装置、客户端设备及系统
US10331758B2 (en) * 2016-09-23 2019-06-25 Hvr Technologies Inc. Digital communications platform for webpage overlay
GB2558870A (en) * 2016-10-25 2018-07-25 Parrotplay As Internet browsing
CN108009171B (zh) * 2016-10-27 2020-06-30 腾讯科技(北京)有限公司 一种提取内容数据的方法和装置
US11295074B2 (en) * 2016-12-21 2022-04-05 Open Text Corporation Systems and methods for conversion of web content into reusable templates and components
CN106599280B (zh) * 2016-12-23 2019-11-22 北京奇虎科技有限公司 确定网页节点路径信息的方法及装置
CN106951451B (zh) * 2017-02-22 2019-11-12 麒麟合盛网络技术股份有限公司 一种网页内容提取方法、装置及计算设备
CN107038240B (zh) * 2017-04-20 2020-07-24 金电联行(北京)信息技术有限公司 一种网页列表内容检测方法
US10447635B2 (en) 2017-05-17 2019-10-15 Slice Technologies, Inc. Filtering electronic messages
CN107463372B (zh) * 2017-07-07 2020-10-13 北京小米移动软件有限公司 一种数据驱动的页面更新方法和装置
CN107463676B (zh) * 2017-08-04 2020-06-30 杭州安恒信息技术股份有限公司 文本数据存储方法及装置
CN107729481B (zh) * 2017-10-16 2020-10-13 鼎富智能科技有限公司 一种自定义规则的文本信息抽取结果筛选方法及装置
CN107919129A (zh) * 2017-11-15 2018-04-17 百度在线网络技术(北京)有限公司 用于控制页面的方法和装置
US11803883B2 (en) 2018-01-29 2023-10-31 Nielsen Consumer Llc Quality assurance for labeled training data
US10922366B2 (en) * 2018-03-27 2021-02-16 International Business Machines Corporation Self-adaptive web crawling and text extraction
CN108563729B (zh) * 2018-04-04 2022-04-01 福州大学 一种基于dom树的招标网站中标信息抽取方法
CN108694242B (zh) * 2018-05-14 2023-03-21 中国平安财产保险股份有限公司 基于dom的节点查找方法、设备、存储介质及装置
CN108920434B (zh) * 2018-06-06 2022-08-30 武汉酷犬数据科技有限公司 一种通用的网页主题内容提取方法和系统
US10846463B2 (en) * 2018-08-01 2020-11-24 Citrix Systems, Inc. Document object model (DOM) element location platform
CN109582886B (zh) * 2018-11-02 2022-05-10 北京字节跳动网络技术有限公司 页面内容提取方法、模板的生成方法及装置、介质及设备
US11226727B2 (en) * 2018-11-12 2022-01-18 Citrix Systems, Inc. Systems and methods for live tiles for SaaS
US20200186623A1 (en) * 2018-12-11 2020-06-11 Microsoft Technology Licensing, Llc Performant retrieval and presentation of content
CN109657180B (zh) * 2018-12-11 2021-11-26 中科国力(镇江)智能技术有限公司 一种智能化网页内容自动模糊抽取系统
CN110163654B (zh) * 2019-04-15 2021-09-17 上海趣蕴网络科技有限公司 一种广告投放数据追踪方法和系统
US11205041B2 (en) * 2019-08-15 2021-12-21 Anil Kumar Web element rediscovery system and method
CN112530616B (zh) * 2019-09-18 2024-01-26 北京广利核系统工程有限公司 一种核电站应急操作规程的判断方法及装置
CN111090797B (zh) * 2019-11-29 2023-07-25 苏宁云计算有限公司 数据获取方法、装置、计算机设备和存储介质
CN111241436A (zh) * 2019-12-31 2020-06-05 五八有限公司 一种数据请求处理方法、装置、终端设备及存储介质
CN111698364B (zh) * 2020-06-19 2021-09-21 深圳市小满科技有限公司 联系人信息提取方法、相关设备及计算机可读存储介质
US11416381B2 (en) 2020-07-17 2022-08-16 Micro Focus Llc Supporting web components in a web testing environment
CN112182319B (zh) * 2020-09-23 2024-03-26 中国建设银行股份有限公司 网页相似度确定方法、网页聚类方法、装置及电子设备
CN112347332A (zh) * 2020-11-17 2021-02-09 南开大学 一种基于XPath的爬虫目标定位方法
CN112579957A (zh) * 2020-12-23 2021-03-30 中国电子信息产业集团有限公司第六研究所 一种基于图像分析的智能化网页内容解析方法
CN112765941A (zh) * 2021-01-21 2021-05-07 语联网(武汉)信息技术有限公司 自动提取网页正文的方法及系统
CN113515917A (zh) * 2021-04-19 2021-10-19 北京明略昭辉科技有限公司 文件信息管理方法、系统、电子设备及存储介质
US11841909B2 (en) 2022-02-11 2023-12-12 International Business Machines Corporation Text analytics views for web site sources
CN114491325A (zh) * 2022-02-16 2022-05-13 平安科技(深圳)有限公司 网页数据的提取方法和装置、计算机设备、存储介质
US11960561B2 (en) * 2022-07-28 2024-04-16 Siteimprove A/S Client-side generation of lossless object model representations of dynamic webpages

Family Cites Families (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6654734B1 (en) * 2000-08-30 2003-11-25 International Business Machines Corporation System and method for query processing and optimization for XML repositories
US7086042B2 (en) * 2002-04-23 2006-08-01 International Business Machines Corporation Generating and utilizing robust XPath expressions
US7213200B2 (en) * 2002-04-23 2007-05-01 International Business Machines Corporation Selectable methods for generating robust XPath expressions
US7127467B2 (en) * 2002-05-10 2006-10-24 Oracle International Corporation Managing expressions in a database system
JP2005301437A (ja) * 2004-04-07 2005-10-27 Hitachi Ins Software Ltd 適応型ウエブページデータ抽出装置および抽出プログラム
US7584194B2 (en) 2004-11-22 2009-09-01 Truveo, Inc. Method and apparatus for an application crawler
GB0428365D0 (en) * 2004-12-24 2005-02-02 Ibm Methods and apparatus for generating a parser and parsing a document
US20070073592A1 (en) 2005-09-28 2007-03-29 Redcarpet, Inc. Method and system for network-based comparision shopping
US20080153467A1 (en) * 2006-03-01 2008-06-26 Eran Shmuel Wyler Methods and apparatus for enabling use of web content on various types of devices
CN101094194B (zh) 2006-06-19 2010-06-23 腾讯科技(深圳)有限公司 一种提取Web页面中用户所需Web信息的方法
US9547648B2 (en) 2006-08-03 2017-01-17 Excalibur Ip, Llc Electronic document information extraction
TW200836075A (en) 2007-02-16 2008-09-01 Esobi Inc Method of converting hypertext markup language web page into pure text and system thereof
US8719291B2 (en) * 2007-04-24 2014-05-06 Lixto Software Gmbh Information extraction using spatial reasoning on the CSS2 visual box model
US20080320031A1 (en) * 2007-06-19 2008-12-25 C/O Canon Kabushiki Kaisha Method and device for analyzing an expression to evaluate
US7765236B2 (en) 2007-08-31 2010-07-27 Microsoft Corporation Extracting data content items using template matching
US8589366B1 (en) * 2007-11-01 2013-11-19 Google Inc. Data extraction using templates
US20090125529A1 (en) * 2007-11-12 2009-05-14 Vydiswaran V G Vinod Extracting information based on document structure and characteristics of attributes
US20090204889A1 (en) 2008-02-13 2009-08-13 Mehta Rupesh R Adaptive sampling of web pages for extraction
US9078095B2 (en) * 2008-03-14 2015-07-07 William J. Johnson System and method for location based inventory management
US20090248707A1 (en) * 2008-03-25 2009-10-01 Yahoo! Inc. Site-specific information-type detection methods and systems
US20100083095A1 (en) * 2008-09-29 2010-04-01 Nikovski Daniel N Method for Extracting Data from Web Pages
US20100169311A1 (en) * 2008-12-30 2010-07-01 Ashwin Tengli Approaches for the unsupervised creation of structural templates for electronic documents
CN102460432B (zh) 2009-06-30 2013-11-20 惠普开发有限公司 选择性内容提取
CN101944094B (zh) * 2009-07-06 2014-06-18 富士通株式会社 网页信息提取方法和装置
US20110040770A1 (en) 2009-08-13 2011-02-17 Yahoo! Inc. Robust xpaths for web information extraction
US8667015B2 (en) 2009-11-25 2014-03-04 Hewlett-Packard Development Company, L.P. Data extraction method, computer program product and system
US9449114B2 (en) 2010-04-15 2016-09-20 Paypal, Inc. Removing non-substantive content from a web page by removing its text-sparse nodes and removing high-frequency sentences of its text-dense nodes using sentence hash value frequency across a web page collection
US8555155B2 (en) * 2010-06-04 2013-10-08 Apple Inc. Reader mode presentation of web content

Also Published As

Publication number Publication date
JP5944985B2 (ja) 2016-07-05
EP2721517A1 (en) 2014-04-23
US9053206B2 (en) 2015-06-09
EP2721517B1 (en) 2019-08-28
CN102831121B (zh) 2015-07-08
JP6141490B2 (ja) 2017-06-07
EP2721517A4 (en) 2015-04-22
US20150242527A1 (en) 2015-08-27
US9767211B2 (en) 2017-09-19
CN102831121A (zh) 2012-12-19
JP2014523016A (ja) 2014-09-08
TW201250492A (en) 2012-12-16
US20130014002A1 (en) 2013-01-10
JP2016154052A (ja) 2016-08-25
WO2012174137A1 (en) 2012-12-20

Similar Documents

Publication Publication Date Title
HK1173821A1 (en) A method and system for extracting information of a web page
EP2689353A4 (en) SYSTEM AND METHOD FOR MASKING DATA
EP2825978A4 (en) SYSTEM AND METHOD FOR PRODUCING A BINARY REPRESENTATION OF A WEB PAGE
EP2801069A4 (en) SYSTEM AND METHOD FOR LOCALLY DISPLAYING INFORMATION TO A SELECTED AREA
GB2492369B (en) Method and system for collecting traffic data
EP2745258A4 (en) SYSTEM AND METHOD FOR PROVIDING ADDITIONAL INFORMATION RELATING TO MULTIMEDIA CONTENT
EP2788885A4 (en) SYSTEM AND METHOD FOR PAGE SHARING BY A DEVICE
ZA201403001B (en) Contact information synchronization system and method
IL214360A0 (en) System and method for main page identification in web decoding
EP2684118A4 (en) METHOD AND SYSTEM FOR INFORMATION MODELING AND APPLICATIONS
EP2766870A4 (en) SYSTEM AND METHOD FOR PROVIDING INFORMATION CONCERNING CONTENT
EP2715647A4 (en) SYSTEM AND METHOD FOR MAINTAINING AND PRESERVING A PARTICULAR DATA BODY
EP2686788A4 (en) SYSTEM AND METHOD FOR SYNCHRONIZING MULTIMEDIA FILES
EP2753141A4 (en) DATA INTERACTION SYSTEM AND METHOD THEREFOR
HK1174162A1 (zh) 網頁訪問者身份識別方法及系統
EP2674867A4 (en) COMPUTER SYSTEM AND INFORMATION MANAGEMENT METHOD
HK1181580A1 (zh) 數據同步方法及系統
HK1182787A1 (zh) 種頁面監控方法及系統
GB201214748D0 (en) Method and apparatus for transferring data from a first domain to a second domain
HK1174708A1 (zh) 種用於創建數據結構的系統和方法
EP2724257A4 (en) SYSTEM AND METHOD FOR DOCUMENT FILTERING
EP2787452A4 (en) METHOD AND SYSTEM FOR SEARCHING INFORMATION
EP2777159A4 (fr) Méthode et système pour coder de l'information
EP2743859A4 (en) INFORMATION MANAGEMENT SYSTEM AND INFORMATION MANAGEMENT PROCESS
EP2687996A4 (en) METHOD AND SYSTEM FOR REAGENCING WEB PAGE