CN104537128A - Webpage information extracting method and device - Google Patents

Webpage information extracting method and device Download PDF

Info

Publication number
CN104537128A
CN104537128A CN201510049895.2A CN201510049895A CN104537128A CN 104537128 A CN104537128 A CN 104537128A CN 201510049895 A CN201510049895 A CN 201510049895A CN 104537128 A CN104537128 A CN 104537128A
Authority
CN
China
Prior art keywords
expression
info web
web
regular expression
arithmetic expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510049895.2A
Other languages
Chinese (zh)
Inventor
席鼎立
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GLODON SOFTWARE Co Ltd
Original Assignee
GLODON SOFTWARE Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GLODON SOFTWARE Co Ltd filed Critical GLODON SOFTWARE Co Ltd
Priority to CN201510049895.2A priority Critical patent/CN104537128A/en
Publication of CN104537128A publication Critical patent/CN104537128A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses webpage information extracting method and device, and relates to the field of internet technology. The webpage information extracting method comprises the following steps: acquiring a source code of a target webpage; acquiring a regular expression corresponding to the attribute of webpage information to be extracted, and a working equation aiming at the regular expression; extracting the webpage information from the source code of the target webpage according to the acquired working equation and the regular equation. When webpage information is extracted by adopting the scheme provided by the embodiment of the invention, redundant information or wrong information in the extracted webpage information can be reduced, relatively precise webpage information can be obtained without secondary processing on the acquired webpage information, and the user experience can be improved.

Description

A kind of info web extracting method and device
Technical field
The present invention relates to Internet technical field, particularly a kind of info web extracting method and device.
Background technology
Internet is a huge Information issued and propagating source, current webpage quantity is more than 80,000,000,000, per hour also with surprising rapidity in growth, the bulk information of user's needs may be comprised in these webpages, such as, for building trade user, potential customers' list and contact details, the price list of building materials, real-time construction work information, supply-demand information and trick acceptance of the bid information etc. in these webpages, may be included.
In practical application, in order to info web can be provided to user targetedly, generally from already present webpage, extract the information useful to user by info web extracting mode, and extracted info web is supplied to user.In prior art, when extracting info web, can be realized by methods such as keyword matchs.Concrete, when extracting info web by key quality control point, in the source code of target web, search the information matched with the key word preset, and extract the information matched with the key word preset.Application said method can extract the relevant information in target web, but, the quantity of information comprised in webpage is larger, the information that multiple key word with presetting matches may be there is, such as, the mailbox message of building trade potential customers and the mailbox message of Web page developer is comprised in target web, the key word preset is the key word for extracting the mailbox message in webpage, the mailbox message of building trade potential customers and the mailbox message of Web page developer that comprise in target web can be extracted like this, clearly the mailbox message of Web page developer is not the information required for user, visible, when application said method extracts info web, may cause in extracted information and comprise redundant information or comprise error message, affect Consumer's Experience.
In addition, when comprising redundant information in extracted information, to eliminate redundancy information, think that user provides comparatively accurate info web, then need to carry out secondary treating to extracted information, information extraction efficiency is low.
Summary of the invention
The embodiment of the invention discloses a kind of info web extracting method and device, think that user provides comparatively accurate info web, improve information extraction efficiency and Consumer's Experience.
For achieving the above object, the embodiment of the invention discloses a kind of info web extracting method, described method comprises:
Obtain the source code of target web;
The regular expression that the attribute of the info web that acquisition will be extracted is corresponding and the arithmetic expression for above-mentioned regular expression;
According to obtained regular expression and arithmetic expression, from the source code of described target web, extract info web.
In a kind of specific implementation of the present invention, described info web extracting method also comprises:
Carry out Classifying Sum to extracted info web, to classify, form is shown to user.
In a kind of specific implementation of the present invention, described according to obtained regular expression and arithmetic expression, from the source code of described target web, extract info web, comprising:
According to obtained regular expressions and arithmetic expression, determine obtained regular expression and inverse Polan expression corresponding to arithmetic expression;
According to determined inverse Polan expression, from the source code of described target web, extract info web.
In a kind of specific implementation of the present invention, the regular expression that the attribute of the info web that described acquisition will be extracted is corresponding and the arithmetic expression for above-mentioned regular expression, comprising:
According to the input information of user, obtain regular expression corresponding to the attribute of info web that will extract and the arithmetic expression for above-mentioned regular expression; Or
According to the expression formula create-rule preset, obtain the regular expression that the attribute of the info web that will extract is corresponding; According to the arithmetic expression create-rule preset, obtain the arithmetic expression for above-mentioned regular expression.
In a kind of specific implementation of the present invention, the sign of operation used in described arithmetic expression is predefined symbol.
For achieving the above object, the embodiment of the invention discloses a kind of info web extraction element, described device comprises:
Source code obtains module, for obtaining the source code of target web;
Expression formula obtains module, the regular expression that the attribute for obtaining the info web that will extract is corresponding and the arithmetic expression for above-mentioned regular expression;
Info web extraction module, for according to obtained regular expression and arithmetic expression, extracts info web from the source code of target web.
In a kind of specific implementation of the present invention, described info web extraction element also comprises:
Classifying Sum module, for carrying out Classifying Sum to extracted info web, to classify, form is shown to user.
In a kind of specific implementation of the present invention, described info web extraction module, comprising:
Inverse Polan expression determination submodule, for according to obtained regular expressions and arithmetic expression, determines obtained regular expression and inverse Polan expression corresponding to arithmetic expression;
Info web extracts submodule, for according to determined inverse Polan expression, from the source code of target web, extracts info web.
In a kind of specific implementation of the present invention, described expression formula obtains module, specifically for the input information according to user, obtains regular expression corresponding to the attribute of the info web that will extract and the arithmetic expression for above-mentioned regular expression; Or
Specifically for according to the expression formula create-rule preset, obtain the regular expression that the attribute of the info web that will extract is corresponding; According to the arithmetic expression create-rule preset, obtain the arithmetic expression for above-mentioned regular expression.
In a kind of specific implementation of the present invention, the sign of operation used in described arithmetic expression is predefined symbol.
As seen from the above, in the scheme that the embodiment of the present invention provides, after obtaining the source code of target web, the regular expression that the attribute of the info web that acquisition will be extracted is corresponding and the arithmetic expression for above-mentioned regular expression, and according to obtained regular expression and arithmetic expression, from the source code of target web, extract info web.Compared with prior art, in the scheme that the embodiment of the present invention provides, what express due to arithmetic expression is logical operation relation between at least two regular expressions in obtained regular expression and arithmetical operation relation, after the conversion of this arithmetic expression, be equivalent to directly carry out filtration treatment to extracted info web when extracting info web, therefore, the redundant information comprised in extracted info web or the error message comprised can be reduced, comparatively accurate info web can be obtained without the need to carrying out secondary treating to obtained info web, Consumer's Experience can be improved.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
The schematic flow sheet of a kind of info web extracting method that Fig. 1 provides for the embodiment of the present invention;
The schematic flow sheet of the another kind of info web extracting method that Fig. 2 provides for the embodiment of the present invention;
The structural representation of a kind of info web extraction element that Fig. 3 provides for the embodiment of the present invention;
The structural representation of the another kind of info web extraction element that Fig. 4 provides for the embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
The schematic flow sheet of a kind of info web extracting method that Fig. 1 provides for the embodiment of the present invention, the method comprises:
S101: the source code obtaining target web.
The source code of webpage is generally with HTML (Hypertext Markup Language) (HyperText Mark-up Language, HTML) write, be made up of HTML order, wherein, HTML order can be used for comment, figure, animation, sound, form and link etc.
S102: obtain regular expression corresponding to the attribute of info web that will extract and the arithmetic expression for above-mentioned regular expression.
Above-mentioned regular expression, also known as normal representation method, conventional expressing method, belongs to a concept of computer science.Regular expression uses single character string to describe, mate a series of character string meeting certain syntactic rule.In a lot of text editor, regular expression is usually used to retrieval, replaces those texts meeting certain pattern.As " " ^ d+ $ ", all nonnegative integers can be mated.
Above-mentioned arithmetic expression have expressed logical operation relation and arithmetical operation relation etc. in obtained regular expression between at least two regular expressions, the sign of operation described required for arithmetic expression can be self-defining symbol, also can be existing sign of operation, the application limit this.
Wherein, sign of operation can be arithmetical operation symbol, logic connective, also can be relational calculus symbol etc.
Concrete, the regular expression that the attribute of the info web that acquisition will be extracted is corresponding and the arithmetic expression for above-mentioned regular expression, can be the input information according to user, obtain regular expression corresponding to the attribute of info web that will extract and the arithmetic expression for above-mentioned regular expression; Can also being the expression formula create-rule according to presetting, obtaining the regular expression that the attribute of the info web that will extract is corresponding, and according to the arithmetic expression create-rule preset, obtain the arithmetic expression for above-mentioned regular expression.
S103: according to obtained regular expression and arithmetic expression, extracts info web from the source code of target web.
In a kind of specific implementation, according to obtained regular expression and arithmetic expression, info web is extracted from the source code of target web, can first according to obtained regular expressions and arithmetic expression, determine obtained regular expression and inverse Polan expression corresponding to arithmetic expression, again according to determined inverse Polan expression, from the source code of target web, extract info web.
In common expression formula, binary operator is always placed between two associated operands, so this representation represents also referred to as infix.Corresponding with the mode of above-mentioned expression, Poland logician J.Lukasiewicz proposes the method for another kind of expression in nineteen twenty-nine, in this method, after each operational symbol is all placed in its operand, be called postfix notation, be called inverse Polan expression with the expression formula that suffix notation postfix notation represents.Inverse Polan expression is a kind of very useful expression formula, the expression formula of complexity is converted to the expression formula that shirtsleeve operation can be relied on to obtain result of calculation by it, such as, common expression formula: (a+b) * (c+d) is converted to inverse Polan expression and is: ab+cd+*.
Arithmetic expression have expressed logical operation relation in obtained regular expression between at least two regular expressions and arithmetical operation relation, below by several instantiation, the relation between arithmetic expression and regular expression is described.
Example one: arithmetic expression represents the union operation of the result to several regular expressions.
Suppose, the info web obtained from the source code of target web according to regular expression 1 is " 5 ", the info web obtained from the source code of target web according to regular expression 2 is " 4 ", then after the union operation process that the arithmetic expression in this example specify, and must " 9 ".
Union operation operational symbol in this example can represent with "+", certainly, also can represent with other symbols user-defined.
Example two: arithmetic expression represents the OR operation of the result to several regular expressions.
Suppose, the info web obtained from the source code of target web according to regular expression 3 is " abc ", the info web obtained from the source code of target web according to regular expression 4 is " cde ", then after the OR operation process that the arithmetic expression in this example specifies, the result obtained is " abc " or " cde ", namely can be understood as, when info web cannot be obtained by regular expression 3, obtain info web by regular expression 4.
OR operation operational symbol in this example can represent with " | ", certainly, also can represent with other symbols user-defined.
Example three: arithmetic expression represents the AND-operation of the result to several regular expressions.
To the result of the regular expression of two in example two, after performing the AND-operation in this example, the result obtained is " abccde ".
AND-operation operational symbol in this example can represent with " & ", certainly, also can represent with other symbols user-defined.
Example four: arithmetic expression represents and operates the result of several regular expressions " going forward one by one ".
This operation refers on the basis of the result of a regular expression, performs other regular expressions.Operation of " going forward one by one " can comprise " going forward one by one in a left side " operation and " going forward one by one in the right side " operation.
" going forward one by one in a left side " operation in this example and " going forward one by one in the right side " operation can represent with " > " and " < ", certainly, also can represent with other symbols user-defined.
It should be noted that, the relation between arithmetic expression and regular expression is not limited in above-mentioned several, and can nestedly use between various relation.
As seen from the above, in the scheme that the present embodiment provides, after obtaining the source code of target web, the regular expression that the attribute of the info web that acquisition will be extracted is corresponding and the arithmetic expression for above-mentioned regular expression, and according to obtained regular expression and arithmetic expression, from the source code of target web, extract info web.Compared with prior art, in the scheme that the present embodiment provides, what express due to arithmetic expression is logical operation relation between at least two regular expressions in obtained regular expression and arithmetical operation relation, after the conversion of this arithmetic expression, be equivalent to directly carry out filtration treatment to extracted info web when extracting info web, therefore, the redundant information comprised in extracted info web or the error message comprised can be reduced, comparatively accurate info web can be obtained without the need to carrying out secondary treating to obtained info web, Consumer's Experience can be improved.
In one particular embodiment of the present invention, see Fig. 2, provide the schematic flow sheet of another kind of info web extracting method, compared with previous embodiment, in the present embodiment, above-mentioned info web extracts and also comprises:
S104: carry out Classifying Sum to extracted info web, to classify, form is shown to user.
Wherein, during Classifying Sum, can using the information of same client as a class, such as, the mailbox, telephone number, mailing address, building materials demand etc. of same client, also can using other information of same class as a class such as, telephone number, mailing address etc.Certainly, also have other mode classifications, will not enumerate here.
As seen from the above, in the scheme that the present embodiment provides, after extracted info web is carried out Classifying Sum, show to user with classification form again, user can be made directly to check the information of needs according to classified information, convenient for users, further can improve Consumer's Experience.
Corresponding with above-mentioned info web extracting method, the embodiment of the present invention additionally provides a kind of info web extraction element.
The structural representation of a kind of info web extraction element that Fig. 3 provides for the embodiment of the present invention, this device comprises: source code obtains module 301, expression formula obtains module 302 and info web extraction module 303.
Wherein, source code obtains module 301, for obtaining the source code of target web;
Expression formula obtains module 302, the regular expression that the attribute for obtaining the info web that will extract is corresponding and the arithmetic expression for above-mentioned regular expression;
Info web extraction module 303, for according to obtained regular expression and arithmetic expression, extracts info web from the source code of target web.
Concrete, above-mentioned info web extraction module 303 can comprise: inverse Polan expression determination submodule and info web extract submodule (not shown).
Wherein, inverse Polan expression determination submodule, for according to obtained regular expressions and arithmetic expression, determines obtained regular expression and inverse Polan expression corresponding to arithmetic expression;
Info web extracts submodule, for according to determined inverse Polan expression, from the source code of target web, extracts info web.
Optionally, above-mentioned expression formula obtains module 302, can specifically for the input information according to user, obtains regular expression corresponding to the attribute of the info web that will extract and the arithmetic expression for above-mentioned regular expression; Or
Specifically for according to the expression formula create-rule preset, the regular expression that the attribute of the info web that will extract is corresponding can be obtained; According to the arithmetic expression create-rule preset, obtain the arithmetic expression for above-mentioned regular expression.
Optionally, the sign of operation used in above-mentioned arithmetic expression is predefined symbol.
As seen from the above, in the scheme that the present embodiment provides, after obtaining the source code of target web, the regular expression that the attribute of the info web that acquisition will be extracted is corresponding and the arithmetic expression for above-mentioned regular expression, and according to obtained regular expression and arithmetic expression, from the source code of target web, extract info web.Compared with prior art, in the scheme that the present embodiment provides, what express due to arithmetic expression is logical operation relation between at least two regular expressions in obtained regular expression and arithmetical operation relation, after the conversion of this arithmetic expression, be equivalent to directly carry out filtration treatment to extracted info web when extracting info web, therefore, the redundant information comprised in extracted info web or the error message comprised can be reduced, comparatively accurate info web can be obtained without the need to carrying out secondary treating to obtained info web, Consumer's Experience can be improved.
In one particular embodiment of the present invention, see Fig. 4, provide the structural representation of another kind of info web extraction element, compared with previous embodiment, in the present embodiment, above-mentioned info web extraction element also comprises: Classifying Sum module 304.
Wherein, Classifying Sum module 304, for carrying out Classifying Sum to extracted info web, to classify, form is shown to user.
As seen from the above, in the scheme that the present embodiment provides, after extracted info web is carried out Classifying Sum, show to user with classification form again, user can be made directly to check the information of needs according to classified information, convenient for users, further can improve Consumer's Experience.
For device embodiment, because it is substantially similar to embodiment of the method, so description is fairly simple, relevant part illustrates see the part of embodiment of the method.
It should be noted that, in this article, the such as relational terms of first and second grades and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.
One of ordinary skill in the art will appreciate that all or part of step realized in said method embodiment is that the hardware that can carry out instruction relevant by program has come, described program can be stored in computer read/write memory medium, here the alleged storage medium obtained, as: ROM/RAM, magnetic disc, CD etc.
The foregoing is only preferred embodiment of the present invention, be not intended to limit protection scope of the present invention.All any amendments done within the spirit and principles in the present invention, equivalent replacement, improvement etc., be all included in protection scope of the present invention.

Claims (10)

1. an info web extracting method, is characterized in that, described method comprises:
Obtain the source code of target web;
The regular expression that the attribute of the info web that acquisition will be extracted is corresponding and the arithmetic expression for above-mentioned regular expression;
According to obtained regular expression and arithmetic expression, from the source code of described target web, extract info web.
2. method according to claim 1, is characterized in that, described method also comprises:
Carry out Classifying Sum to extracted info web, to classify, form is shown to user.
3. method according to claim 1, is characterized in that, described according to obtained regular expression and arithmetic expression, extracts info web, comprising from the source code of described target web:
According to obtained regular expressions and arithmetic expression, determine obtained regular expression and inverse Polan expression corresponding to arithmetic expression;
According to determined inverse Polan expression, from the source code of described target web, extract info web.
4. the method according to any one of claim 1-3, is characterized in that, the regular expression that the attribute of the info web that described acquisition will be extracted is corresponding and the arithmetic expression for above-mentioned regular expression, comprising:
According to the input information of user, obtain regular expression corresponding to the attribute of info web that will extract and the arithmetic expression for above-mentioned regular expression; Or
According to the expression formula create-rule preset, obtain the regular expression that the attribute of the info web that will extract is corresponding; According to the arithmetic expression create-rule preset, obtain the arithmetic expression for above-mentioned regular expression.
5. the method according to any one of claim 1-3, is characterized in that, the sign of operation used in described arithmetic expression is predefined symbol.
6. an info web extraction element, is characterized in that, described device comprises:
Source code obtains module, for obtaining the source code of target web;
Expression formula obtains module, the regular expression that the attribute for obtaining the info web that will extract is corresponding and the arithmetic expression for above-mentioned regular expression;
Info web extraction module, for according to obtained regular expression and arithmetic expression, extracts info web from the source code of target web.
7. device according to claim 6, is characterized in that, described device also comprises:
Classifying Sum module, for carrying out Classifying Sum to extracted info web, to classify, form is shown to user.
8. device according to claim 6, is characterized in that, described info web extraction module, comprising:
Inverse Polan expression determination submodule, for according to obtained regular expressions and arithmetic expression, determines obtained regular expression and inverse Polan expression corresponding to arithmetic expression;
Info web extracts submodule, for according to determined inverse Polan expression, from the source code of target web, extracts info web.
9. the device according to any one of claim 6-8, it is characterized in that, described expression formula obtains module, specifically for the input information according to user, obtains regular expression corresponding to the attribute of the info web that will extract and the arithmetic expression for above-mentioned regular expression; Or
Specifically for according to the expression formula create-rule preset, obtain the regular expression that the attribute of the info web that will extract is corresponding; According to the arithmetic expression create-rule preset, obtain the arithmetic expression for above-mentioned regular expression.
10. the device according to any one of claim 6-8, is characterized in that, the sign of operation used in described arithmetic expression is predefined symbol.
CN201510049895.2A 2015-01-30 2015-01-30 Webpage information extracting method and device Pending CN104537128A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510049895.2A CN104537128A (en) 2015-01-30 2015-01-30 Webpage information extracting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510049895.2A CN104537128A (en) 2015-01-30 2015-01-30 Webpage information extracting method and device

Publications (1)

Publication Number Publication Date
CN104537128A true CN104537128A (en) 2015-04-22

Family

ID=52852655

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510049895.2A Pending CN104537128A (en) 2015-01-30 2015-01-30 Webpage information extracting method and device

Country Status (1)

Country Link
CN (1) CN104537128A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111698364A (en) * 2020-06-19 2020-09-22 深圳市小满科技有限公司 Contact person information extraction method and related equipment
CN111782907A (en) * 2020-07-01 2020-10-16 北京知因智慧科技有限公司 News classification method and device and electronic equipment
CN111966881A (en) * 2020-10-14 2020-11-20 成都数联铭品科技有限公司 Webpage information extraction method and system and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101094135A (en) * 2006-06-23 2007-12-26 腾讯科技(深圳)有限公司 Method and system for extracting information of content in Internet
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system
US20120110003A1 (en) * 2010-11-03 2012-05-03 Microsoft Corporation Conditional execution of regular expressions
CN102693303A (en) * 2012-05-18 2012-09-26 上海极值信息技术有限公司 Method and device for searching formulation data
CN103259793A (en) * 2013-05-02 2013-08-21 东北大学 Method for inspecting deep packets based on suffix automaton regular engine structure

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101094135A (en) * 2006-06-23 2007-12-26 腾讯科技(深圳)有限公司 Method and system for extracting information of content in Internet
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system
US20120110003A1 (en) * 2010-11-03 2012-05-03 Microsoft Corporation Conditional execution of regular expressions
CN102693303A (en) * 2012-05-18 2012-09-26 上海极值信息技术有限公司 Method and device for searching formulation data
CN103259793A (en) * 2013-05-02 2013-08-21 东北大学 Method for inspecting deep packets based on suffix automaton regular engine structure

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111698364A (en) * 2020-06-19 2020-09-22 深圳市小满科技有限公司 Contact person information extraction method and related equipment
CN111782907A (en) * 2020-07-01 2020-10-16 北京知因智慧科技有限公司 News classification method and device and electronic equipment
CN111782907B (en) * 2020-07-01 2024-03-01 北京知因智慧科技有限公司 News classification method and device and electronic equipment
CN111966881A (en) * 2020-10-14 2020-11-20 成都数联铭品科技有限公司 Webpage information extraction method and system and electronic equipment

Similar Documents

Publication Publication Date Title
CN113807098B (en) Model training method and device, electronic equipment and storage medium
US9275041B2 (en) Performing sentiment analysis on microblogging data, including identifying a new opinion term therein
US11613008B2 (en) Automating a process using robotic process automation code
JP7293643B2 (en) A semi-automated method, system, and program for translating the content of structured documents into chat-based interactions
CN110825363B (en) Intelligent contract acquisition method and device, electronic equipment and storage medium
JP2019519019A (en) Method, apparatus and device for identifying text type
JP2019519019A5 (en)
WO2022132944A1 (en) Generation and/or recommendation of tools for automating aspects of computer programming
US20120179658A1 (en) Cleansing a Database System to Improve Data Quality
CN110674620A (en) Target file generation method, device, medium and electronic equipment
CN102750289A (en) Tag group classifying method and equipment as well as data mixing method and equipment
CN114970522A (en) Language model pre-training method, device, equipment and storage medium
CN104142990A (en) Search method and device
CN110688844A (en) Text labeling method and device
CN104951219A (en) Text input method for mobile terminal and mobile terminal
CN103106211B (en) Emotion recognition method and emotion recognition device for customer consultation texts
CN104537128A (en) Webpage information extracting method and device
CN101470699B (en) Information extraction model training apparatus, information extraction apparatus and information extraction system and method thereof
CN112818026A (en) Data integration method and device
CN114064925A (en) Knowledge graph construction method, data query method, device, equipment and medium
US10198426B2 (en) Method, system, and computer program product for dividing a term with appropriate granularity
Litvak et al. Hierarchical summarization of financial reports with RUNNER
Bhatt et al. Web Scraping: Huge Data Collection from Web
US10482171B2 (en) Digital form optimization
CN112115362B (en) Programming information recommendation method and device based on similar code recognition

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150422

RJ01 Rejection of invention patent application after publication