CN111737623A

CN111737623A - Webpage information extraction method and related equipment

Info

Publication number: CN111737623A
Application number: CN202010569085.0A
Authority: CN
Inventors: 刘镇熙
Original assignee: Shenzhen Xiaoman Technology Co ltd
Current assignee: Shenzhen Xiaoman Technology Co ltd
Priority date: 2020-06-19
Filing date: 2020-06-19
Publication date: 2020-10-02

Abstract

The invention provides a webpage information extraction method, a webpage information extraction device, computer equipment and a computer readable storage medium. The webpage information extraction method acquires a target webpage; converting the target webpage into a target tree; acquiring text content from the target tree; performing word segmentation on the text content to obtain a plurality of character strings; determining the representation information, the position information and the tree structure information of each character string, and recording the representation information, the position information and the tree structure information as characteristic information; and combining the characteristic information of the character strings to obtain the characteristic representation of the target webpage. The invention improves the accuracy of extracting the webpage information.

Description

Webpage information extraction method and related equipment

Technical Field

The invention relates to the technical field of data processing, in particular to a method and a device for extracting webpage information, computer equipment and a computer readable storage medium.

Background

Generally, before analyzing the content of a web page, information extraction needs to be performed on the web page. At present, when information of a webpage is extracted, only text information of the webpage is extracted, and structural information of a text in the webpage is not extracted. The structural information of the text in the web page is likely to have an important role in analyzing the content of the web page. Extracting inaccurate web page features will affect the analysis of web page content.

How to improve the accuracy of webpage information extraction becomes a problem to be solved.

Disclosure of Invention

In view of the foregoing, there is a need for a method, an apparatus, a computer device and a computer readable storage medium for extracting webpage information, which can extract a user's tag according to the user's registration information at each website, and describe the webpage information extraction with the extracted user's tag.

A first aspect of the present application provides a method for extracting web page information, where the method for extracting web page information includes:

acquiring a target webpage;

converting the target webpage into a target tree;

acquiring text content from the target tree;

performing word segmentation on the text content to obtain a plurality of character strings;

determining characteristic information of each character string, wherein the characteristic information comprises representation information, position information and tree structure information;

and combining the characteristic information of the character strings to obtain the characteristic representation of the target webpage.

In another possible implementation manner, the acquiring the target webpage includes:

acquiring a webpage address of the target webpage;

and downloading the HTML text corresponding to the webpage address by utilizing a webpage crawling tool.

In another possible implementation manner, the converting the target webpage into the target tree includes:

and converting the target webpage into the target tree by a depth-first traversal method by utilizing a webpage analysis tool.

In another possible implementation manner, before the converting the target webpage into the target tree, the webpage information extracting method further includes:

and carrying out standardization processing on the target webpage.

In another possible implementation manner, before the obtaining the text content from the target tree, the method for extracting the web page information further includes:

and carrying out pruning pretreatment on the target tree.

In another possible implementation manner, the text content includes a chinese text, and the segmenting the text content includes:

performing word segmentation on the Chinese text by combining a natural language processing algorithm; or

And training a word segmentation model based on the BERT, and segmenting the Chinese text by the trained word segmentation model.

In another possible implementation manner, the determining the feature information of each character string includes:

inquiring the coding information of each character string from a preset coding table, and determining the coding information of each character string as the representation information of the character string;

numbering the character strings in sequence according to the positions of the character strings in the text content, and determining the number of each character string as the position information of each character string;

and acquiring nodes of the character strings in the target tree and paths from the nodes to root nodes of the target tree, and determining the nodes and paths of the character strings as tree structure information of the character strings.

A second aspect of the present application provides a web page information extraction apparatus, including:

the first acquisition module is used for acquiring a target webpage;

the conversion module is used for converting the target webpage into a target tree;

the second acquisition module is used for acquiring text contents from the target tree;

the word segmentation module is used for segmenting words of the text content to obtain a plurality of character strings;

the determining module is used for determining characteristic information of each character string, and the characteristic information comprises representation information, position information and tree structure information;

and the combination module is used for combining the characteristic information of the character strings to obtain the characteristic representation of the target webpage.

A third aspect of the present application provides a computer device comprising a processor for implementing the web page information extraction method when executing a computer program stored in a memory.

A fourth aspect of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the web page information extraction method.

The method includes the steps of obtaining a target webpage; converting the target webpage into a target tree; acquiring text content from the target tree; performing word segmentation on the text content to obtain a plurality of character strings; determining characteristic information of each character string, wherein the characteristic information comprises representation information, position information and tree structure information; and combining the characteristic information of the character strings to obtain the characteristic representation of the target webpage. The invention improves the accuracy of extracting the webpage information.

Drawings

Fig. 1 is a flowchart of a method for extracting web page information according to an embodiment of the present invention.

Fig. 2 is a block diagram of a web page information extraction device according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a computer device provided by an embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth to provide a thorough understanding of the present invention, and the described embodiments are merely a subset of the embodiments of the present invention, rather than a complete embodiment. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

Preferably, the webpage information extraction method is applied to one or more computer devices. The computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

Example one

Fig. 1 is a flowchart of a method for extracting web page information according to an embodiment of the present invention. The webpage information extraction method is applied to computer equipment and used for extracting webpage information.

As shown in fig. 1, the web page information extraction method includes:

101, acquiring a target webpage.

In an embodiment, the acquiring the target webpage includes:

acquiring a webpage address of the target webpage;

For example, a URL (Uniform Resource Locator) address input by a user is received, and HTML (hypertext markup Language) text of the target web page is downloaded according to the URL address by using a web page crawling tool.

For another example, the URL address is read from the preset address table, and the HTML text of the target webpage is crawled or downloaded according to the URL address.

In another embodiment, the target webpage is stored in a preset database, and the target webpage is directly read from the preset database.

And 102, converting the target webpage into a target tree.

In a specific embodiment, the converting the target web page into the target tree includes:

Specifically, the web page parsing tools include a python-based LXML tool, a JAVA-based jsup tool, and the like. The Object tree is a DOM (Document Object Model) tree.

In a specific embodiment, before the converting the target webpage into the target tree, the webpage information extracting method further includes:

and carrying out standardization processing on the target webpage.

Specifically, the normalization processing mainly includes removing invisible information such as scripts, comments and style labels, the invisible information is invisible on a page and cannot be presented to a user through a browser, and the invisible information needs to be removed first in order to avoid interference with subsequent processing.

And 103, acquiring text content from the target tree.

In a specific embodiment, before the obtaining the text content from the target tree, the method for extracting the web page information further includes:

and carrying out pruning pretreatment on the target tree.

Specifically, obtaining an attribute label and a preset first label of each node in the target tree; judging whether the attribute label of each node is consistent with the preset first label or not; and determining the node with the attribute label consistent with the preset first label as an unorthodox point, and deleting the unorthodox point.

The attribute label of the unorthodox point can be replaced by a preset second label. And the preset second label comprises the content to be acquired.

Pruning preprocessing the target tree may remove or replace extraneous content that is not needed for use. Such as protocol type, encoding type, etc.

In a specific embodiment, the obtaining the text content from the target tree includes:

and traversing all nodes in the target tree by a recursive method, and reading the text content in each node.

And 104, performing word segmentation on the text content to obtain a plurality of character strings.

In a specific embodiment, the text content includes an english text, and the segmenting the text content includes:

and performing word segmentation on the English text by taking the blank as a word segmentation mark to obtain the plurality of character strings.

In another embodiment, the text content comprises chinese text, and the tokenizing the text content comprises:

The word segmentation model can take the Chinese sentence as a linear input sequence consisting of single words, and carry out sequence labeling on the linear input sequence (namely, each word is marked with a label). The set of tags may include { B, M, E, O }, where B represents that the word is a beginning character of a Chinese sentence, M represents that the word is a middle character of the Chinese sentence, E represents that the word is an ending character of the Chinese sentence, and O represents a word consisting of one word.

Each string may include one or more words.

And 105, determining characteristic information of each character string, wherein the characteristic information comprises representation information, position information and tree structure information.

In a specific embodiment, the determining the feature information of each character string includes:

inquiring coding information of each character string from a preset coding table, and determining the coding information of the character string as representing information of the character string;

numbering the character strings according to the positions of the character strings in the text content, and determining the numbers of the character strings as the position information of the character strings;

and acquiring nodes to which the character strings belong in the target tree and paths from each node to a root node of the target tree, and determining the nodes and paths to which the character strings belong as tree structure information of the character strings.

The position information of each character string also includes the paragraph to which the character string belongs, the text structure (e.g., the character belongs to the primary title, the secondary title or the body, etc.).

In another embodiment, the characteristic information of the character string further includes tag information, boundary information, hyperlink information of the character string. The label information of the character string includes a label (such as an h label and a p label) to which the character string belongs. The boundary information of the character string comprises whether the character string is positioned at the paragraph head or the paragraph tail (a first distance between the character string and the paragraph head of the character string is obtained, a second distance between the character string and the paragraph tail of the character string is obtained; when the first distance is smaller than a preset first distance, the character string is positioned at the paragraph head; and when the second distance is smaller than a preset second distance, the character string is positioned at the paragraph tail). The hyperlink information of the character string includes whether the character string is a hyperlink (when the label of the character string is a "< a >" label, the character string is a hyperlink; when the label of the character string is not a "< a >" label, the character string is not a hyperlink).

In another embodiment, the method for extracting web page information further includes:

and storing the characteristic information of each character string in a preset data structure.

Specifically, the preset data structure is an htmlToken structure. Each htmlToken structure corresponds to a character string, and includes representation information, position information, and tree structure information of the character string. The html token is "index-token-element", where index indicates position information of the character string, token indicates representation information of the character string, and element indicates tree structure information of the character string, and indicates a DOM tree structure in which the character string is located.

For example, the plurality of character strings are [ token₁,token₂,…,token_n]Wherein n is the number of the plurality of character strings. The characteristic information of the character strings is htmlToken₁、htmlToken₂、…、htmlToken_n。

And 106, combining the characteristic information of the character strings to obtain the characteristic representation of the target webpage.

Combining feature information htmlToken of the plurality of character strings₁、htmlToken₂、…、htmlToken_nAnd obtaining the feature representation htmlTokens of the target webpage. Wherein htmlTokens [ htmlToken ]₁,htmlToken₂,…,htmlToken_n]. The feature representation of the target webpage is an extension of the plurality of character strings, and each element in the feature representation stores representation information, position information and tree structure information of one character string.

In the first embodiment, the webpage features of the target webpage are extracted to obtain the feature representation of the target webpage. The feature representation of the target web page includes representation information of a character string of the target web page, position information, and tree structure information. The webpage information extraction method of the first embodiment improves the accuracy of webpage information extraction.

acquiring a plurality of webpages to be classified;

extracting webpage features of the multiple webpages to be classified;

and classifying the multiple webpages to be classified according to the webpage characteristics of the multiple webpages to be classified.

Example two

Fig. 2 is a structural diagram of a web page information extraction device according to a second embodiment of the present invention. The web page information extraction device 20 is applied to a computer device. The web page information extraction device 20 is used for extracting web page information.

As shown in fig. 2, the web page information extraction apparatus 20 may include a first obtaining module 201, a conversion module 202, a second obtaining module 203, a word segmentation module 204, a determination module 205, and a combination module 206.

The first obtaining module 201 is configured to obtain a target webpage.

In an embodiment, the acquiring the target webpage includes:

acquiring a webpage address of the target webpage;

A conversion module 202, configured to convert the target web page into a target tree.

In a specific embodiment, the web page information extraction apparatus further includes a preprocessing module, configured to perform normalization processing on the target web page before the target web page is converted into the target tree.

A second obtaining module 203, configured to obtain text content from the target tree.

In a specific embodiment, the preprocessing module is further configured to perform pruning preprocessing on the target tree before the text content is obtained from the target tree.

And the word segmentation module 204 is configured to perform word segmentation on the text content to obtain a plurality of character strings.

Each string may include one or more words.

A determining module 205, configured to determine feature information of each character string, where the feature information includes representation information, location information, and tree structure information.

In another embodiment, the web page information extraction apparatus further includes a storage module, configured to store feature information of each character string in a preset data structure.

For example, the plurality of character strings is [ token _1, token _2, …, token _ n ], where n is the number of the plurality of character strings. The characteristic information of the character strings is htmlToken1, htmlToken2, … and htmlTokenn.

And the combining module 206 is configured to combine the feature information of the plurality of character strings to obtain a feature representation of the target webpage.

And combining the feature information htmlToken1, htmlToken2, … and htmlTokenn of the character strings to obtain the feature representation htmlTokens of the target webpage. Wherein, htmlTokens ═[ htmlToken _1, htmlToken _2, …, htmlToken _ n ]. The feature representation of the target webpage is an extension of the plurality of character strings, and each element in the feature representation stores representation information, position information and tree structure information of one character string.

The web page information extraction device 20 according to the second embodiment extracts the web page features of the target web page to obtain the feature representation of the target web page. The feature representation of the target web page includes representation information of a character string of the target web page, position information, and tree structure information. The second embodiment improves the accuracy of extracting the webpage information.

In another embodiment, the web page information extraction device further comprises a classification module, configured to obtain a plurality of web pages to be classified; extracting webpage features of the multiple webpages to be classified; and classifying the multiple webpages to be classified according to the webpage characteristics of the multiple webpages to be classified.

EXAMPLE III

The present embodiment provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method for extracting webpage information, for example, the

steps

101 and 106 shown in fig. 1:

101, acquiring a target webpage;

102, converting the target webpage into a target tree;

103, acquiring text content from the target tree;

104, performing word segmentation on the text content to obtain a plurality of character strings;

105, determining characteristic information of each character string, wherein the characteristic information comprises representation information, position information and tree structure information;

Alternatively, the computer program, when executed by the processor, implements the functions of the modules in the above device embodiments, such as the

module

201 and 206 in fig. 2:

a first obtaining module 201, configured to obtain a target webpage;

a conversion module 202, configured to convert the target webpage into a target tree;

a second obtaining module 203, configured to obtain text content from the target tree;

a word segmentation module 204, configured to perform word segmentation on the text content to obtain a plurality of character strings;

a determining module 205, configured to determine feature information of each character string, where the feature information includes representation information, location information, and tree structure information;

Example four

Fig. 3 is a schematic diagram of a computer device according to a third embodiment of the present invention. The computer device 30 comprises a memory 301, a processor 302 and a computer program 303, such as a web page information extraction program, stored in the memory 301 and executable on the processor 302. The processor 302, when executing the computer program 303, implements the steps in the above-mentioned webpage information extracting method embodiment, such as 101-106 shown in fig. 1:

101, acquiring a target webpage;

102, converting the target webpage into a target tree;

103, acquiring text content from the target tree;

module

201 and 206 in fig. 2:

a first obtaining module 201, configured to obtain a target webpage;

Illustratively, the computer program 303 may be partitioned into one or more modules that are stored in the memory 301 and executed by the processor 302 to perform the present method. The one or more modules may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 303 in the computer device 30. For example, the computer program 303 may be divided into a first obtaining module 201, a converting module 202, a second obtaining module 203, a word segmentation module 204, a determining module 205, and a combination module 206 in fig. 2, and specific functions of each module are described in embodiment two.

Those skilled in the art will appreciate that the schematic diagram 3 is merely an example of the computer device 30 and does not constitute a limitation of the computer device 30, and may include more or less components than those shown, or combine certain components, or different components, for example, the computer device 30 may also include input and output devices, network access devices, buses, etc.

The Processor 302 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor 302 may be any conventional processor or the like, the processor 302 being the control center for the computer device 30 and connecting the various parts of the overall computer device 30 using various interfaces and lines.

The memory 301 may be used to store the computer program 303, and the processor 302 may implement various functions of the computer device 30 by running or executing the computer program or module stored in the memory 301 and calling data stored in the memory 301. The memory 301 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the computer device 30, and the like. Further, the memory 301 may include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other non-volatile solid state storage device.

The modules integrated by the computer device 30 may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware form, and can also be realized in a form of hardware and a software functional module.

The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the method for extracting web page information according to various embodiments of the present invention.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned. Furthermore, it is to be understood that the word "comprising" does not exclude other modules or steps, and the singular does not exclude the plural. A plurality of modules or means recited in the system claims may also be implemented by one module or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A webpage information extraction method is characterized by comprising the following steps:

acquiring a target webpage;

converting the target webpage into a target tree;

acquiring text content from the target tree;

2. The web page information extraction method according to claim 1, wherein the acquiring the target web page includes:

acquiring a webpage address of the target webpage;

3. The web page information extraction method of claim 1, wherein the converting the target web page into a target tree comprises:

4. The web page information extraction method according to claim 1, wherein, before said converting the target web page into the target tree, the web page information extraction method further comprises:

and carrying out standardization processing on the target webpage.

5. The web page information extraction method according to claim 1, wherein, before the acquiring of the text content from the target tree, the web page information extraction method further comprises:

and carrying out pruning pretreatment on the target tree.

6. The method for extracting web page information according to claim 1, wherein the text content includes a chinese text, and the segmenting the text content includes:

7. The web page information extraction method according to claim 1, wherein the determining the feature information of each character string includes:

8. A web page information extraction device, characterized by comprising:

the first acquisition module is used for acquiring a target webpage;

9. A computer device comprising a processor for executing a computer program stored in a memory to implement the web page information extraction method according to any one of claims 1 to 7.

10. A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the web page information extraction method according to any one of claims 1 to 7.