CN111737623A - Webpage information extraction method and related equipment - Google Patents

Webpage information extraction method and related equipment Download PDF

Info

Publication number
CN111737623A
CN111737623A CN202010569085.0A CN202010569085A CN111737623A CN 111737623 A CN111737623 A CN 111737623A CN 202010569085 A CN202010569085 A CN 202010569085A CN 111737623 A CN111737623 A CN 111737623A
Authority
CN
China
Prior art keywords
information
target
webpage
web page
character string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010569085.0A
Other languages
Chinese (zh)
Inventor
刘镇熙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Xiaoman Technology Co ltd
Original Assignee
Shenzhen Xiaoman Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Xiaoman Technology Co ltd filed Critical Shenzhen Xiaoman Technology Co ltd
Priority to CN202010569085.0A priority Critical patent/CN111737623A/en
Publication of CN111737623A publication Critical patent/CN111737623A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a webpage information extraction method, a webpage information extraction device, computer equipment and a computer readable storage medium. The webpage information extraction method acquires a target webpage; converting the target webpage into a target tree; acquiring text content from the target tree; performing word segmentation on the text content to obtain a plurality of character strings; determining the representation information, the position information and the tree structure information of each character string, and recording the representation information, the position information and the tree structure information as characteristic information; and combining the characteristic information of the character strings to obtain the characteristic representation of the target webpage. The invention improves the accuracy of extracting the webpage information.

Description

Webpage information extraction method and related equipment
Technical Field
The invention relates to the technical field of data processing, in particular to a method and a device for extracting webpage information, computer equipment and a computer readable storage medium.
Background
Generally, before analyzing the content of a web page, information extraction needs to be performed on the web page. At present, when information of a webpage is extracted, only text information of the webpage is extracted, and structural information of a text in the webpage is not extracted. The structural information of the text in the web page is likely to have an important role in analyzing the content of the web page. Extracting inaccurate web page features will affect the analysis of web page content.
How to improve the accuracy of webpage information extraction becomes a problem to be solved.
Disclosure of Invention
In view of the foregoing, there is a need for a method, an apparatus, a computer device and a computer readable storage medium for extracting webpage information, which can extract a user's tag according to the user's registration information at each website, and describe the webpage information extraction with the extracted user's tag.
A first aspect of the present application provides a method for extracting web page information, where the method for extracting web page information includes:
acquiring a target webpage;
converting the target webpage into a target tree;
acquiring text content from the target tree;
performing word segmentation on the text content to obtain a plurality of character strings;
determining characteristic information of each character string, wherein the characteristic information comprises representation information, position information and tree structure information;
and combining the characteristic information of the character strings to obtain the characteristic representation of the target webpage.
In another possible implementation manner, the acquiring the target webpage includes:
acquiring a webpage address of the target webpage;
and downloading the HTML text corresponding to the webpage address by utilizing a webpage crawling tool.
In another possible implementation manner, the converting the target webpage into the target tree includes:
and converting the target webpage into the target tree by a depth-first traversal method by utilizing a webpage analysis tool.
In another possible implementation manner, before the converting the target webpage into the target tree, the webpage information extracting method further includes:
and carrying out standardization processing on the target webpage.
In another possible implementation manner, before the obtaining the text content from the target tree, the method for extracting the web page information further includes:
and carrying out pruning pretreatment on the target tree.
In another possible implementation manner, the text content includes a chinese text, and the segmenting the text content includes:
performing word segmentation on the Chinese text by combining a natural language processing algorithm; or
And training a word segmentation model based on the BERT, and segmenting the Chinese text by the trained word segmentation model.
In another possible implementation manner, the determining the feature information of each character string includes:
inquiring the coding information of each character string from a preset coding table, and determining the coding information of each character string as the representation information of the character string;
numbering the character strings in sequence according to the positions of the character strings in the text content, and determining the number of each character string as the position information of each character string;
and acquiring nodes of the character strings in the target tree and paths from the nodes to root nodes of the target tree, and determining the nodes and paths of the character strings as tree structure information of the character strings.
A second aspect of the present application provides a web page information extraction apparatus, including:
the first acquisition module is used for acquiring a target webpage;
the conversion module is used for converting the target webpage into a target tree;
the second acquisition module is used for acquiring text contents from the target tree;
the word segmentation module is used for segmenting words of the text content to obtain a plurality of character strings;
the determining module is used for determining characteristic information of each character string, and the characteristic information comprises representation information, position information and tree structure information;
and the combination module is used for combining the characteristic information of the character strings to obtain the characteristic representation of the target webpage.
A third aspect of the present application provides a computer device comprising a processor for implementing the web page information extraction method when executing a computer program stored in a memory.
A fourth aspect of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the web page information extraction method.
The method includes the steps of obtaining a target webpage; converting the target webpage into a target tree; acquiring text content from the target tree; performing word segmentation on the text content to obtain a plurality of character strings; determining characteristic information of each character string, wherein the characteristic information comprises representation information, position information and tree structure information; and combining the characteristic information of the character strings to obtain the characteristic representation of the target webpage. The invention improves the accuracy of extracting the webpage information.
Drawings
Fig. 1 is a flowchart of a method for extracting web page information according to an embodiment of the present invention.
Fig. 2 is a block diagram of a web page information extraction device according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a computer device provided by an embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth to provide a thorough understanding of the present invention, and the described embodiments are merely a subset of the embodiments of the present invention, rather than a complete embodiment. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
Preferably, the webpage information extraction method is applied to one or more computer devices. The computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
Example one
Fig. 1 is a flowchart of a method for extracting web page information according to an embodiment of the present invention. The webpage information extraction method is applied to computer equipment and used for extracting webpage information.
As shown in fig. 1, the web page information extraction method includes:
101, acquiring a target webpage.
In an embodiment, the acquiring the target webpage includes:
acquiring a webpage address of the target webpage;
and downloading the HTML text corresponding to the webpage address by utilizing a webpage crawling tool.
For example, a URL (Uniform Resource Locator) address input by a user is received, and HTML (hypertext markup Language) text of the target web page is downloaded according to the URL address by using a web page crawling tool.
For another example, the URL address is read from the preset address table, and the HTML text of the target webpage is crawled or downloaded according to the URL address.
In another embodiment, the target webpage is stored in a preset database, and the target webpage is directly read from the preset database.
And 102, converting the target webpage into a target tree.
In a specific embodiment, the converting the target web page into the target tree includes:
and converting the target webpage into the target tree by a depth-first traversal method by utilizing a webpage analysis tool.
Specifically, the web page parsing tools include a python-based LXML tool, a JAVA-based jsup tool, and the like. The Object tree is a DOM (Document Object Model) tree.
In a specific embodiment, before the converting the target webpage into the target tree, the webpage information extracting method further includes:
and carrying out standardization processing on the target webpage.
Specifically, the normalization processing mainly includes removing invisible information such as scripts, comments and style labels, the invisible information is invisible on a page and cannot be presented to a user through a browser, and the invisible information needs to be removed first in order to avoid interference with subsequent processing.
And 103, acquiring text content from the target tree.
In a specific embodiment, before the obtaining the text content from the target tree, the method for extracting the web page information further includes:
and carrying out pruning pretreatment on the target tree.
Specifically, obtaining an attribute label and a preset first label of each node in the target tree; judging whether the attribute label of each node is consistent with the preset first label or not; and determining the node with the attribute label consistent with the preset first label as an unorthodox point, and deleting the unorthodox point.
The attribute label of the unorthodox point can be replaced by a preset second label. And the preset second label comprises the content to be acquired.
Pruning preprocessing the target tree may remove or replace extraneous content that is not needed for use. Such as protocol type, encoding type, etc.
In a specific embodiment, the obtaining the text content from the target tree includes:
and traversing all nodes in the target tree by a recursive method, and reading the text content in each node.
And 104, performing word segmentation on the text content to obtain a plurality of character strings.
In a specific embodiment, the text content includes an english text, and the segmenting the text content includes:
and performing word segmentation on the English text by taking the blank as a word segmentation mark to obtain the plurality of character strings.
In another embodiment, the text content comprises chinese text, and the tokenizing the text content comprises:
performing word segmentation on the Chinese text by combining a natural language processing algorithm; or
And training a word segmentation model based on the BERT, and segmenting the Chinese text by the trained word segmentation model.
The word segmentation model can take the Chinese sentence as a linear input sequence consisting of single words, and carry out sequence labeling on the linear input sequence (namely, each word is marked with a label). The set of tags may include { B, M, E, O }, where B represents that the word is a beginning character of a Chinese sentence, M represents that the word is a middle character of the Chinese sentence, E represents that the word is an ending character of the Chinese sentence, and O represents a word consisting of one word.
Each string may include one or more words.
And 105, determining characteristic information of each character string, wherein the characteristic information comprises representation information, position information and tree structure information.
In a specific embodiment, the determining the feature information of each character string includes:
inquiring coding information of each character string from a preset coding table, and determining the coding information of the character string as representing information of the character string;
numbering the character strings according to the positions of the character strings in the text content, and determining the numbers of the character strings as the position information of the character strings;
and acquiring nodes to which the character strings belong in the target tree and paths from each node to a root node of the target tree, and determining the nodes and paths to which the character strings belong as tree structure information of the character strings.
The position information of each character string also includes the paragraph to which the character string belongs, the text structure (e.g., the character belongs to the primary title, the secondary title or the body, etc.).
In another embodiment, the characteristic information of the character string further includes tag information, boundary information, hyperlink information of the character string. The label information of the character string includes a label (such as an h label and a p label) to which the character string belongs. The boundary information of the character string comprises whether the character string is positioned at the paragraph head or the paragraph tail (a first distance between the character string and the paragraph head of the character string is obtained, a second distance between the character string and the paragraph tail of the character string is obtained; when the first distance is smaller than a preset first distance, the character string is positioned at the paragraph head; and when the second distance is smaller than a preset second distance, the character string is positioned at the paragraph tail). The hyperlink information of the character string includes whether the character string is a hyperlink (when the label of the character string is a "< a >" label, the character string is a hyperlink; when the label of the character string is not a "< a >" label, the character string is not a hyperlink).
In another embodiment, the method for extracting web page information further includes:
and storing the characteristic information of each character string in a preset data structure.
Specifically, the preset data structure is an htmlToken structure. Each htmlToken structure corresponds to a character string, and includes representation information, position information, and tree structure information of the character string. The html token is "index-token-element", where index indicates position information of the character string, token indicates representation information of the character string, and element indicates tree structure information of the character string, and indicates a DOM tree structure in which the character string is located.
For example, the plurality of character strings are [ token1,token2,…,tokenn]Wherein n is the number of the plurality of character strings. The characteristic information of the character strings is htmlToken1、htmlToken2、…、htmlTokenn
And 106, combining the characteristic information of the character strings to obtain the characteristic representation of the target webpage.
Combining feature information htmlToken of the plurality of character strings1、htmlToken2、…、htmlTokennAnd obtaining the feature representation htmlTokens of the target webpage. Wherein htmlTokens [ htmlToken ]1,htmlToken2,…,htmlTokenn]. The feature representation of the target webpage is an extension of the plurality of character strings, and each element in the feature representation stores representation information, position information and tree structure information of one character string.
In the first embodiment, the webpage features of the target webpage are extracted to obtain the feature representation of the target webpage. The feature representation of the target web page includes representation information of a character string of the target web page, position information, and tree structure information. The webpage information extraction method of the first embodiment improves the accuracy of webpage information extraction.
In another embodiment, the method for extracting web page information further includes:
acquiring a plurality of webpages to be classified;
extracting webpage features of the multiple webpages to be classified;
and classifying the multiple webpages to be classified according to the webpage characteristics of the multiple webpages to be classified.
Example two
Fig. 2 is a structural diagram of a web page information extraction device according to a second embodiment of the present invention. The web page information extraction device 20 is applied to a computer device. The web page information extraction device 20 is used for extracting web page information.
As shown in fig. 2, the web page information extraction apparatus 20 may include a first obtaining module 201, a conversion module 202, a second obtaining module 203, a word segmentation module 204, a determination module 205, and a combination module 206.
The first obtaining module 201 is configured to obtain a target webpage.
In an embodiment, the acquiring the target webpage includes:
acquiring a webpage address of the target webpage;
and downloading the HTML text corresponding to the webpage address by utilizing a webpage crawling tool.
For example, a URL (Uniform Resource Locator) address input by a user is received, and HTML (hypertext markup Language) text of the target web page is downloaded according to the URL address by using a web page crawling tool.
For another example, the URL address is read from the preset address table, and the HTML text of the target webpage is crawled or downloaded according to the URL address.
In another embodiment, the target webpage is stored in a preset database, and the target webpage is directly read from the preset database.
A conversion module 202, configured to convert the target web page into a target tree.
In a specific embodiment, the converting the target web page into the target tree includes:
and converting the target webpage into the target tree by a depth-first traversal method by utilizing a webpage analysis tool.
Specifically, the web page parsing tools include a python-based LXML tool, a JAVA-based jsup tool, and the like. The Object tree is a DOM (Document Object Model) tree.
In a specific embodiment, the web page information extraction apparatus further includes a preprocessing module, configured to perform normalization processing on the target web page before the target web page is converted into the target tree.
Specifically, the normalization processing mainly includes removing invisible information such as scripts, comments and style labels, the invisible information is invisible on a page and cannot be presented to a user through a browser, and the invisible information needs to be removed first in order to avoid interference with subsequent processing.
A second obtaining module 203, configured to obtain text content from the target tree.
In a specific embodiment, the preprocessing module is further configured to perform pruning preprocessing on the target tree before the text content is obtained from the target tree.
Specifically, obtaining an attribute label and a preset first label of each node in the target tree; judging whether the attribute label of each node is consistent with the preset first label or not; and determining the node with the attribute label consistent with the preset first label as an unorthodox point, and deleting the unorthodox point.
The attribute label of the unorthodox point can be replaced by a preset second label. And the preset second label comprises the content to be acquired.
Pruning preprocessing the target tree may remove or replace extraneous content that is not needed for use. Such as protocol type, encoding type, etc.
In a specific embodiment, the obtaining the text content from the target tree includes:
and traversing all nodes in the target tree by a recursive method, and reading the text content in each node.
And the word segmentation module 204 is configured to perform word segmentation on the text content to obtain a plurality of character strings.
In a specific embodiment, the text content includes an english text, and the segmenting the text content includes:
and performing word segmentation on the English text by taking the blank as a word segmentation mark to obtain the plurality of character strings.
In another embodiment, the text content comprises chinese text, and the tokenizing the text content comprises:
performing word segmentation on the Chinese text by combining a natural language processing algorithm; or
And training a word segmentation model based on the BERT, and segmenting the Chinese text by the trained word segmentation model.
The word segmentation model can take the Chinese sentence as a linear input sequence consisting of single words, and carry out sequence labeling on the linear input sequence (namely, each word is marked with a label). The set of tags may include { B, M, E, O }, where B represents that the word is a beginning character of a Chinese sentence, M represents that the word is a middle character of the Chinese sentence, E represents that the word is an ending character of the Chinese sentence, and O represents a word consisting of one word.
Each string may include one or more words.
A determining module 205, configured to determine feature information of each character string, where the feature information includes representation information, location information, and tree structure information.
In a specific embodiment, the determining the feature information of each character string includes:
inquiring coding information of each character string from a preset coding table, and determining the coding information of the character string as representing information of the character string;
numbering the character strings according to the positions of the character strings in the text content, and determining the numbers of the character strings as the position information of the character strings;
and acquiring nodes to which the character strings belong in the target tree and paths from each node to a root node of the target tree, and determining the nodes and paths to which the character strings belong as tree structure information of the character strings.
The position information of each character string also includes the paragraph to which the character string belongs, the text structure (e.g., the character belongs to the primary title, the secondary title or the body, etc.).
In another embodiment, the characteristic information of the character string further includes tag information, boundary information, hyperlink information of the character string. The label information of the character string includes a label (such as an h label and a p label) to which the character string belongs. The boundary information of the character string comprises whether the character string is positioned at the paragraph head or the paragraph tail (a first distance between the character string and the paragraph head of the character string is obtained, a second distance between the character string and the paragraph tail of the character string is obtained; when the first distance is smaller than a preset first distance, the character string is positioned at the paragraph head; and when the second distance is smaller than a preset second distance, the character string is positioned at the paragraph tail). The hyperlink information of the character string includes whether the character string is a hyperlink (when the label of the character string is a "< a >" label, the character string is a hyperlink; when the label of the character string is not a "< a >" label, the character string is not a hyperlink).
In another embodiment, the web page information extraction apparatus further includes a storage module, configured to store feature information of each character string in a preset data structure.
Specifically, the preset data structure is an htmlToken structure. Each htmlToken structure corresponds to a character string, and includes representation information, position information, and tree structure information of the character string. The html token is "index-token-element", where index indicates position information of the character string, token indicates representation information of the character string, and element indicates tree structure information of the character string, and indicates a DOM tree structure in which the character string is located.
For example, the plurality of character strings is [ token _1, token _2, …, token _ n ], where n is the number of the plurality of character strings. The characteristic information of the character strings is htmlToken1, htmlToken2, … and htmlTokenn.
And the combining module 206 is configured to combine the feature information of the plurality of character strings to obtain a feature representation of the target webpage.
And combining the feature information htmlToken1, htmlToken2, … and htmlTokenn of the character strings to obtain the feature representation htmlTokens of the target webpage. Wherein, htmlTokens ═[ htmlToken _1, htmlToken _2, …, htmlToken _ n ]. The feature representation of the target webpage is an extension of the plurality of character strings, and each element in the feature representation stores representation information, position information and tree structure information of one character string.
The web page information extraction device 20 according to the second embodiment extracts the web page features of the target web page to obtain the feature representation of the target web page. The feature representation of the target web page includes representation information of a character string of the target web page, position information, and tree structure information. The second embodiment improves the accuracy of extracting the webpage information.
In another embodiment, the web page information extraction device further comprises a classification module, configured to obtain a plurality of web pages to be classified; extracting webpage features of the multiple webpages to be classified; and classifying the multiple webpages to be classified according to the webpage characteristics of the multiple webpages to be classified.
EXAMPLE III
The present embodiment provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method for extracting webpage information, for example, the steps 101 and 106 shown in fig. 1:
101, acquiring a target webpage;
102, converting the target webpage into a target tree;
103, acquiring text content from the target tree;
104, performing word segmentation on the text content to obtain a plurality of character strings;
105, determining characteristic information of each character string, wherein the characteristic information comprises representation information, position information and tree structure information;
and 106, combining the characteristic information of the character strings to obtain the characteristic representation of the target webpage.
Alternatively, the computer program, when executed by the processor, implements the functions of the modules in the above device embodiments, such as the module 201 and 206 in fig. 2:
a first obtaining module 201, configured to obtain a target webpage;
a conversion module 202, configured to convert the target webpage into a target tree;
a second obtaining module 203, configured to obtain text content from the target tree;
a word segmentation module 204, configured to perform word segmentation on the text content to obtain a plurality of character strings;
a determining module 205, configured to determine feature information of each character string, where the feature information includes representation information, location information, and tree structure information;
and the combining module 206 is configured to combine the feature information of the plurality of character strings to obtain a feature representation of the target webpage.
Example four
Fig. 3 is a schematic diagram of a computer device according to a third embodiment of the present invention. The computer device 30 comprises a memory 301, a processor 302 and a computer program 303, such as a web page information extraction program, stored in the memory 301 and executable on the processor 302. The processor 302, when executing the computer program 303, implements the steps in the above-mentioned webpage information extracting method embodiment, such as 101-106 shown in fig. 1:
101, acquiring a target webpage;
102, converting the target webpage into a target tree;
103, acquiring text content from the target tree;
104, performing word segmentation on the text content to obtain a plurality of character strings;
105, determining characteristic information of each character string, wherein the characteristic information comprises representation information, position information and tree structure information;
and 106, combining the characteristic information of the character strings to obtain the characteristic representation of the target webpage.
Alternatively, the computer program, when executed by the processor, implements the functions of the modules in the above device embodiments, such as the module 201 and 206 in fig. 2:
a first obtaining module 201, configured to obtain a target webpage;
a conversion module 202, configured to convert the target webpage into a target tree;
a second obtaining module 203, configured to obtain text content from the target tree;
a word segmentation module 204, configured to perform word segmentation on the text content to obtain a plurality of character strings;
a determining module 205, configured to determine feature information of each character string, where the feature information includes representation information, location information, and tree structure information;
and the combining module 206 is configured to combine the feature information of the plurality of character strings to obtain a feature representation of the target webpage.
Illustratively, the computer program 303 may be partitioned into one or more modules that are stored in the memory 301 and executed by the processor 302 to perform the present method. The one or more modules may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 303 in the computer device 30. For example, the computer program 303 may be divided into a first obtaining module 201, a converting module 202, a second obtaining module 203, a word segmentation module 204, a determining module 205, and a combination module 206 in fig. 2, and specific functions of each module are described in embodiment two.
Those skilled in the art will appreciate that the schematic diagram 3 is merely an example of the computer device 30 and does not constitute a limitation of the computer device 30, and may include more or less components than those shown, or combine certain components, or different components, for example, the computer device 30 may also include input and output devices, network access devices, buses, etc.
The Processor 302 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor 302 may be any conventional processor or the like, the processor 302 being the control center for the computer device 30 and connecting the various parts of the overall computer device 30 using various interfaces and lines.
The memory 301 may be used to store the computer program 303, and the processor 302 may implement various functions of the computer device 30 by running or executing the computer program or module stored in the memory 301 and calling data stored in the memory 301. The memory 301 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the computer device 30, and the like. Further, the memory 301 may include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other non-volatile solid state storage device.
The modules integrated by the computer device 30 may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware form, and can also be realized in a form of hardware and a software functional module.
The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the method for extracting web page information according to various embodiments of the present invention.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned. Furthermore, it is to be understood that the word "comprising" does not exclude other modules or steps, and the singular does not exclude the plural. A plurality of modules or means recited in the system claims may also be implemented by one module or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A webpage information extraction method is characterized by comprising the following steps:
acquiring a target webpage;
converting the target webpage into a target tree;
acquiring text content from the target tree;
performing word segmentation on the text content to obtain a plurality of character strings;
determining characteristic information of each character string, wherein the characteristic information comprises representation information, position information and tree structure information;
and combining the characteristic information of the character strings to obtain the characteristic representation of the target webpage.
2. The web page information extraction method according to claim 1, wherein the acquiring the target web page includes:
acquiring a webpage address of the target webpage;
and downloading the HTML text corresponding to the webpage address by utilizing a webpage crawling tool.
3. The web page information extraction method of claim 1, wherein the converting the target web page into a target tree comprises:
and converting the target webpage into the target tree by a depth-first traversal method by utilizing a webpage analysis tool.
4. The web page information extraction method according to claim 1, wherein, before said converting the target web page into the target tree, the web page information extraction method further comprises:
and carrying out standardization processing on the target webpage.
5. The web page information extraction method according to claim 1, wherein, before the acquiring of the text content from the target tree, the web page information extraction method further comprises:
and carrying out pruning pretreatment on the target tree.
6. The method for extracting web page information according to claim 1, wherein the text content includes a chinese text, and the segmenting the text content includes:
performing word segmentation on the Chinese text by combining a natural language processing algorithm; or
And training a word segmentation model based on the BERT, and segmenting the Chinese text by the trained word segmentation model.
7. The web page information extraction method according to claim 1, wherein the determining the feature information of each character string includes:
inquiring the coding information of each character string from a preset coding table, and determining the coding information of each character string as the representation information of the character string;
numbering the character strings in sequence according to the positions of the character strings in the text content, and determining the number of each character string as the position information of each character string;
and acquiring nodes of the character strings in the target tree and paths from the nodes to root nodes of the target tree, and determining the nodes and paths of the character strings as tree structure information of the character strings.
8. A web page information extraction device, characterized by comprising:
the first acquisition module is used for acquiring a target webpage;
the conversion module is used for converting the target webpage into a target tree;
the second acquisition module is used for acquiring text contents from the target tree;
the word segmentation module is used for segmenting words of the text content to obtain a plurality of character strings;
the determining module is used for determining characteristic information of each character string, and the characteristic information comprises representation information, position information and tree structure information;
and the combination module is used for combining the characteristic information of the character strings to obtain the characteristic representation of the target webpage.
9. A computer device comprising a processor for executing a computer program stored in a memory to implement the web page information extraction method according to any one of claims 1 to 7.
10. A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the web page information extraction method according to any one of claims 1 to 7.
CN202010569085.0A 2020-06-19 2020-06-19 Webpage information extraction method and related equipment Pending CN111737623A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010569085.0A CN111737623A (en) 2020-06-19 2020-06-19 Webpage information extraction method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010569085.0A CN111737623A (en) 2020-06-19 2020-06-19 Webpage information extraction method and related equipment

Publications (1)

Publication Number Publication Date
CN111737623A true CN111737623A (en) 2020-10-02

Family

ID=72651915

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010569085.0A Pending CN111737623A (en) 2020-06-19 2020-06-19 Webpage information extraction method and related equipment

Country Status (1)

Country Link
CN (1) CN111737623A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114218515A (en) * 2021-12-21 2022-03-22 北京大学 Web digital object extraction method and system based on content segmentation
CN114610985A (en) * 2022-05-10 2022-06-10 北京百炼智能科技有限公司 Information extraction method and device, electronic equipment and storage medium
CN114817804A (en) * 2022-03-31 2022-07-29 北京达佳互联信息技术有限公司 Webpage generation method and device, electronic equipment and storage medium
WO2022179128A1 (en) * 2021-02-25 2022-09-01 深圳壹账通智能科技有限公司 Crawler-based data crawling method and apparatus, computer device, and storage medium
CN115146592A (en) * 2022-07-14 2022-10-04 中国银行股份有限公司 Document marking method and device, electronic equipment and storage medium
CN116049597A (en) * 2023-01-10 2023-05-02 北京百度网讯科技有限公司 Pre-training method and device for multi-task model of webpage and electronic equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101491A (en) * 2018-07-24 2018-12-28 湖南星汉数智科技有限公司 A kind of author information abstracting method, device, computer installation and computer readable storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101491A (en) * 2018-07-24 2018-12-28 湖南星汉数智科技有限公司 A kind of author information abstracting method, device, computer installation and computer readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
邵振凯: "网页信息提取技术", 《计算机技术与发展》, no. 09, pages 36 - 38 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022179128A1 (en) * 2021-02-25 2022-09-01 深圳壹账通智能科技有限公司 Crawler-based data crawling method and apparatus, computer device, and storage medium
CN114218515A (en) * 2021-12-21 2022-03-22 北京大学 Web digital object extraction method and system based on content segmentation
CN114218515B (en) * 2021-12-21 2022-09-06 北京大学 Web digital object extraction method and system based on content segmentation
CN114817804A (en) * 2022-03-31 2022-07-29 北京达佳互联信息技术有限公司 Webpage generation method and device, electronic equipment and storage medium
CN114610985A (en) * 2022-05-10 2022-06-10 北京百炼智能科技有限公司 Information extraction method and device, electronic equipment and storage medium
CN114610985B (en) * 2022-05-10 2022-08-19 北京百炼智能科技有限公司 Information extraction method and device, electronic equipment and storage medium
CN115146592A (en) * 2022-07-14 2022-10-04 中国银行股份有限公司 Document marking method and device, electronic equipment and storage medium
CN116049597A (en) * 2023-01-10 2023-05-02 北京百度网讯科技有限公司 Pre-training method and device for multi-task model of webpage and electronic equipment
CN116049597B (en) * 2023-01-10 2024-04-19 北京百度网讯科技有限公司 Pre-training method and device for multi-task model of webpage and electronic equipment

Similar Documents

Publication Publication Date Title
CN111737623A (en) Webpage information extraction method and related equipment
US7958444B2 (en) Visualizing document annotations in the context of the source document
CN109933796B (en) Method and device for extracting key information of bulletin text
WO2017080090A1 (en) Extraction and comparison method for text of webpage
WO2020000717A1 (en) Web page classification method and device, and computer-readable storage medium
JP2010086517A (en) Computer-implemented method for extracting data from web page
CN112084381A (en) Event extraction method, system, storage medium and equipment
CN111783394A (en) Training method of event extraction model, event extraction method, system and equipment
CN112084746A (en) Entity identification method, system, storage medium and equipment
CN106960058A (en) A kind of structure of web page alteration detection method and system
CN109165373B (en) Data processing method and device
CN115438162A (en) Knowledge graph-based disease question-answering method, system, equipment and storage medium
Kim Analysis of standard vocabulary use of the open government data: the case of the public data portal of Korea
CN107145591B (en) Title-based webpage effective metadata content extraction method
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN112035723A (en) Resource library determination method and device, storage medium and electronic device
Zanibbi et al. Math search for the masses: Multimodal search interfaces and appearance-based retrieval
CN106372232B (en) Information mining method and device based on artificial intelligence
KR102457962B1 (en) Method and apparatus for extracting metadata of thesis
CN111339457A (en) Method and apparatus for extracting information from web page and storage medium
Jou Schema extraction for deep web query interfaces using heuristics rules
CN108694192B (en) Webpage type judging method and device
Furche et al. Turn the page: automated traversal of paginated websites
CN108959248A (en) A kind of entity mask method and device, computer readable storage medium
CN114637505A (en) Page content extraction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination