CN109033282B - Webpage text extraction method and device based on extraction template - Google Patents

Webpage text extraction method and device based on extraction template Download PDF

Info

Publication number
CN109033282B
CN109033282B CN201810760576.6A CN201810760576A CN109033282B CN 109033282 B CN109033282 B CN 109033282B CN 201810760576 A CN201810760576 A CN 201810760576A CN 109033282 B CN109033282 B CN 109033282B
Authority
CN
China
Prior art keywords
webpage
neural network
text
extraction
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810760576.6A
Other languages
Chinese (zh)
Other versions
CN109033282A (en
Inventor
董瑞朝
董新建
李贞�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Peony Information Technology Co ltd
Original Assignee
Shandong Peony Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Peony Information Technology Co ltd filed Critical Shandong Peony Information Technology Co ltd
Priority to CN201810760576.6A priority Critical patent/CN109033282B/en
Publication of CN109033282A publication Critical patent/CN109033282A/en
Application granted granted Critical
Publication of CN109033282B publication Critical patent/CN109033282B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention provides a webpage text extraction method and device based on an extraction template. The method comprises the following steps: acquiring webpage information of a webpage of which text information is to be extracted, an IP address of the webpage and webpage content; if the extraction mode is judged to be template extraction, acquiring a target extraction template corresponding to the webpage information, wherein the target extraction template comprises at least one piece of initial information and at least one piece of end information; performing segmentation processing on the webpage according to the segment start information and the segment end information to obtain one or more webpage segments; sequentially extracting fields of each webpage segment to obtain a plurality of fields corresponding to each webpage segment; and performing dictionary mapping on fields by using a dictionary in a database to obtain dictionary fields corresponding to the fields in the dictionary, and storing the fields in a data table corresponding to the dictionary fields to realize extraction of texts in the webpage. The device is used for executing the method. The invention can conveniently and quickly acquire the text information in the webpage.

Description

Webpage text extraction method and device based on extraction template
Technical Field
The invention relates to the technical field of computers, in particular to a webpage text extraction method and device based on an extraction template.
Background
With the development of the era, the world wide web has become an important source of information for people. Users usually view web pages directly by using browsers, and in addition, many information processing tasks based on the internet (such as information search, data mining, machine translation, etc.) are also performed by using information contents of web pages as basic data. However, text information of web pages on the internet is often surrounded by "web page noise" such as advertisement links, navigation bars, copyright information, and the like. How to accurately and efficiently extract the text information of the webpage becomes an important subject of the current network information extraction and application, and has high application value and practical significance.
At present, methods for extracting the text of a web page can be mainly classified into methods based on statistics, DOM structures, blocks and the like.
The statistical-based webpage text extraction method extracts the webpage text by searching the node containing the maximum number of Chinese characters. The Web extraction technology based on DOM is to extract some meaningful specific tags in a webpage, express an HTML document into a DOM tree structure, and extract effective node data in the tree according to the specific tags. The method based on webpage blocking is to divide the Web page presented to the user into several semantic blocks, and analyze the importance degree of each block on the page to find out the text content of the webpage.
With the rise of internet, html-based content information is increasing exponentially, and on the basis of meeting the browsing requirements of daily users, the webpage text is difficult to be extracted quickly and efficiently by the extraction method.
Disclosure of Invention
In view of the above, an object of the embodiments of the present invention is to provide a method and an apparatus for extracting a web page text based on an extraction template, so as to solve the above technical problem.
In a first aspect, an embodiment of the present invention provides a method for extracting a web page text based on an extraction template, including:
acquiring webpage information of a webpage of which text information is to be extracted, and an IP address and webpage content of the webpage;
if the extraction mode is judged to be template extraction, acquiring a target extraction template corresponding to the webpage information, wherein the target extraction template comprises at least one piece of initial information and at least one piece of end information;
performing segmentation processing on the webpage according to the segment start information and the segment end information to obtain one or more webpage segments;
sequentially extracting fields of each webpage segment to obtain a plurality of fields corresponding to each webpage segment;
and performing dictionary mapping on the fields by using a dictionary in a database to obtain dictionary fields corresponding to the fields in the dictionary, and storing the fields into a data table corresponding to the dictionary fields to realize extraction of texts in the webpage.
Further, the method further comprises:
a plurality of extraction templates are stored in a database in advance, and each type of webpage corresponds to one extraction template.
Further, the method further comprises:
performing emotion recognition on the text in the webpage by using an emotion classification model to obtain an emotion type; the emotion categories include positive, neutral, and negative.
Further, after obtaining the target extraction template corresponding to the web page information, the method further includes:
and analyzing the target extraction template to obtain a corresponding XML file.
Further, the method further comprises:
and if the extraction mode is judged to be automatic text extraction, matching the IP address with the IP address in the blacklist in the database, and if the matching fails, extracting the text of the webpage according to a link ratio method and/or a density-based method.
Further, the link ratio method includes:
carrying out code identification and conversion, tag case consistency, de-noising tag and tag arrangement on the webpage;
if the < TR > and the </TR > are empty before being judged and known, deleting the < TR > and the </TR >;
and calculating a content link ratio corresponding to the webpage by taking a table label as a separator, and if the content link ratio is greater than a preset threshold, taking the content with the content link ratio greater than the preset threshold as a text.
Further, the density-based method includes:
performing label removal processing on the webpage content;
dividing the webpage content according to rows, and counting the row length of the webpage content;
taking a line number corresponding to the line length of the first line exceeding a second preset threshold value as a sudden rising point, and taking a line number behind the sudden rising point as a sudden falling point;
and taking the content between the sudden rising point and the sudden falling point as the text in the webpage.
Further, the method further comprises:
and constructing an emotion classification model in advance, and training the emotion classification model.
Further, the pre-constructing an emotion classification model and training the emotion classification model includes:
respectively constructing a first neural network, a second neural network and a third neural network, wherein the first neural network is used for identifying positive emotion, the second neural network is used for identifying neutral emotion, and the third neural network is used for identifying negative emotion;
training the first neural network, the second neural network and the third neural network respectively by utilizing a training data set to obtain a first output result corresponding to the first neural network, a second output result corresponding to the second neural network and a third output result corresponding to the third neural network;
calculating a first loss function between a first output result and a tag result, a second loss function between the second output result and the tag result, and a third loss function between the third output result and the tag result;
optimizing parameters in the first neural network by using the first loss function, optimizing parameters in the second neural network by using the second loss function, and optimizing parameters in the third neural network by using the third loss function;
and the optimized first neural network, the optimized second neural network and the optimized third neural network form the emotion classification model.
Further, the calculating a first loss function between the first output result and the tag result, a second loss function between the second output result and the tag result, and a third loss function between the third output result and the tag result includes:
acquiring a first output matrix corresponding to the first output result, a second output matrix corresponding to the second output result, a third output matrix corresponding to the third output structure and a label matrix corresponding to the label result;
calculating a first Euclidean distance between the first output matrix and the label matrix according to a Euclidean distance calculation formula, and obtaining the first loss function according to the first Euclidean distance;
calculating a second Euclidean distance between the second output matrix and the label matrix according to the Euclidean distance calculation formula, and obtaining a second loss function according to the second Euclidean distance;
and calculating a third Euclidean distance between the third output matrix and the label matrix according to the Euclidean distance calculation formula, and obtaining a third loss function according to the third Euclidean distance.
Further, the euclidean distance calculation formula includes:
according to
Figure BDA0001727227360000041
Calculating Euclidean distance between two row vectors in the first matrix to obtain a first intermediate matrix
Figure BDA0001727227360000042
Wherein dakjIs the Euclidean distance between the k row vector and the j row vector in the first matrix, akIs the value of the k row element in the first matrix, ajIs the j row element value in the first matrix;
according to
Figure BDA0001727227360000043
Calculating two row vectors in the second matrixThe Euclidean distance between the two matrixes to obtain a second intermediate matrix
Figure BDA0001727227360000044
Wherein dbkjIs the Euclidean distance between the k row vector and the j row vector in the second matrix, bkIs the value of the k row element in the second matrix, bjIs the j row element value in the second matrix;
according to
Figure BDA0001727227360000045
Calculating the intermediate Euclidean distance between the second intermediate matrix and the first intermediate matrix
Figure BDA0001727227360000051
Wherein the first matrix comprises the first output matrix, the second output matrix, or the third output matrix; the second matrix is the label matrix.
Further, the method further comprises:
the method comprises the steps of obtaining a training data set in advance, carrying out emotion marking on words, and constructing a label matrix according to marked words, wherein the training data set comprises a plurality of words.
Further, before storing the field in the data table corresponding to the dictionary field, the method further includes:
and acquiring the submission time, the author and the source corresponding to the webpage.
Further, the method further comprises:
and configuring the IP address, the port number, the table name and the user name of the database.
Further, before dictionary mapping the field with a dictionary in a database, the method further includes:
and carrying out formatting processing on the field.
In a second aspect, an embodiment of the present invention provides an extraction apparatus for web page text based on an extraction template, including:
the acquisition module is used for acquiring webpage information of a webpage of which text information is to be extracted, and an IP address and webpage content of the webpage;
the template acquisition module is used for acquiring a target extraction template corresponding to the webpage information if the extraction mode is judged to be template extraction, wherein the target extraction template comprises at least one piece of initial information and at least one piece of end information;
the webpage segmentation module is used for segmenting the webpage according to the segment start information and the segment end information to obtain one or more webpage segments;
the field extraction module is used for sequentially extracting fields of the webpage segments to obtain a plurality of fields corresponding to each webpage segment;
and the dictionary mapping module is used for performing dictionary mapping on the fields by utilizing a dictionary in a database to obtain dictionary fields corresponding to the fields in the dictionary and storing the fields into a data table corresponding to the dictionary fields so as to realize extraction of texts in the webpage.
Further, the apparatus further comprises:
the template storage module is used for storing a plurality of extraction templates in a database in advance, and each type of webpage corresponds to one extraction template.
Further, the apparatus further comprises:
the identification module is used for carrying out emotion identification on the text in the webpage by using the emotion classification model to obtain an emotion category; the emotion categories include positive, neutral, and negative.
Further, the apparatus further comprises:
and the analysis module is used for analyzing the target extraction template to obtain a corresponding XML file.
Further, the apparatus further comprises:
and the text extraction module is used for matching the IP address with the IP address in a blacklist in the database if the extraction mode is judged to be automatic text extraction, and extracting the text of the webpage according to a link ratio method and/or a density-based method if the matching fails.
Further, the link ratio method includes:
carrying out code identification and conversion, tag case consistency, de-noising tag and tag arrangement on the webpage;
if the < TR > and the </TR > are empty before being judged and known, deleting the < TR > and the </TR >;
and calculating a content link ratio corresponding to the webpage by taking a table label as a separator, and if the content link ratio is greater than a preset threshold, taking the content with the content link ratio greater than the preset threshold as a text.
Further, the density-based method includes:
performing label removal processing on the webpage content;
dividing the webpage content according to rows, and counting the row length of the webpage content;
taking a line number corresponding to the line length of the first line exceeding a second preset threshold value as a sudden rising point, and taking a line number behind the sudden rising point as a sudden falling point;
and taking the content between the sudden rising point and the sudden falling point as the text in the webpage.
Further, the apparatus further comprises:
and the model training module is used for constructing an emotion classification model in advance and training the emotion classification model.
Further, the model training module is specifically configured to:
respectively constructing a first neural network, a second neural network and a third neural network, wherein the first neural network is used for identifying positive emotion, the second neural network is used for identifying neutral emotion, and the third neural network is used for identifying negative emotion;
training the first neural network, the second neural network and the third neural network respectively by utilizing a training data set to obtain a first output result corresponding to the first neural network, a second output result corresponding to the second neural network and a third output result corresponding to the third neural network;
calculating a first loss function between a first output result and a tag result, a second loss function between the second output result and the tag result, and a third loss function between the third output result and the tag result;
optimizing parameters in the first neural network by using the first loss function, optimizing parameters in the second neural network by using the second loss function, and optimizing parameters in the third neural network by using the third loss function;
and the optimized first neural network, the optimized second neural network and the optimized third neural network form the emotion classification model.
Further, the model training module is specifically configured to:
acquiring a first output matrix corresponding to the first output result, a second output matrix corresponding to the second output result, a third output matrix corresponding to the third output structure and a label matrix corresponding to the label result;
calculating a first Euclidean distance between the first output matrix and the label matrix according to a Euclidean distance calculation formula, and obtaining the first loss function according to the first Euclidean distance;
calculating a second Euclidean distance between the second output matrix and the label matrix according to the Euclidean distance calculation formula, and obtaining a second loss function according to the second Euclidean distance;
and calculating a third Euclidean distance between the third output matrix and the label matrix according to the Euclidean distance calculation formula, and obtaining a third loss function according to the third Euclidean distance.
Further, the euclidean distance calculation formula includes:
according to
Figure BDA0001727227360000071
Calculate the firstThe Euclidean distance between two row vectors in the matrix is used for obtaining a first intermediate matrix
Figure BDA0001727227360000072
Wherein dakjIs the Euclidean distance between the k row vector and the j row vector in the first matrix, akIs the value of the k row element in the first matrix, ajIs the j row element value in the first matrix;
according to
Figure BDA0001727227360000081
Calculating the Euclidean distance between two row vectors in the second matrix to obtain a second intermediate matrix
Figure BDA0001727227360000082
Wherein dbkjIs the Euclidean distance between the k row vector and the j row vector in the second matrix, bkIs the value of the k row element in the second matrix, bjIs the j row element value in the second matrix;
according to
Figure BDA0001727227360000083
Calculating the intermediate Euclidean distance between the second intermediate matrix and the first intermediate matrix
Figure BDA0001727227360000084
Wherein the first matrix comprises the first output matrix, the second output matrix, or the third output matrix; the second matrix is the label matrix.
Further, the apparatus further comprises:
the training data set acquisition module is used for acquiring a training data set in advance, wherein the training data set comprises a plurality of words, performing emotion marking on the words, and constructing a label matrix according to the marked words.
Further, the apparatus further comprises:
and the information acquisition module is used for acquiring the submission time, the author and the source corresponding to the webpage.
Further, the apparatus further comprises:
and the system configuration module is used for configuring the IP address, the port number, the table name and the user name of the database.
Further, the apparatus further comprises:
and the formatting module is used for carrying out formatting processing on the field.
In a third aspect, an embodiment of the present invention provides an electronic device, including: a processor, a memory, and a bus, wherein,
the processor and the memory are communicated with each other through the bus;
the memory stores program instructions executable by the processor, the processor being capable of performing the method steps of the first aspect when invoked by the program instructions.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, including:
the non-transitory computer readable storage medium stores computer instructions that cause the computer to perform the method steps of the first aspect.
According to the embodiment of the invention, the corresponding target extraction template is obtained according to the webpage information, the webpage segmentation and field extraction are carried out according to the target extraction template, the dictionary is utilized for mapping and then the mapping is stored in the corresponding database, so that the text information in the webpage is obtained, and the text information in the webpage can be conveniently and quickly obtained through the corresponding template.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a schematic flow chart of a webpage text extraction method based on an extraction template according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of a method for extracting a web page text according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a web page text extraction device based on an extraction template according to an embodiment of the present invention;
fig. 4 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present invention, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
Fig. 1 is a schematic flow chart of a webpage text extraction method based on an extraction template according to an embodiment of the present invention, as shown in fig. 1, the method includes:
step 101: acquiring webpage information of a webpage of which text information is to be extracted, and an IP address and webpage content of the webpage;
in a specific implementation process, when text information in a webpage needs to be extracted, webpage information corresponding to the webpage is acquired, wherein the webpage information comprises an IP address, and webpage content corresponding to the IP address can be acquired by analyzing the IP address. It should be noted that the web page content includes many tags, and not all tags are the body of the web page.
Step 102: if the extraction mode is judged to be template extraction, acquiring a target extraction template corresponding to the webpage information, wherein the target extraction template comprises at least one piece of initial information and at least one piece of end information;
in a specific implementation process, a user may select an extraction mode, and if the device receives that the extraction mode selected by the user is template extraction, the device acquires a corresponding target extraction template according to web page information, where the target extraction template includes at least one piece of start information and at least one piece of end information, it should be noted that one piece of start information corresponds to one piece of end information, and the web page may be segmented through the piece of start information and the piece of end information. In addition, different web pages may have different page styles or different page source information, so different extraction templates need to be configured according to different pages. And if a new webpage needs to be subjected to text extraction, corresponding parameters in the frame are directly configured according to the extracted template frame without rewriting codes.
Wherein, part of codes in the extracted template are as follows:
extracting template definitions and defining the four-layer structure of wrapper:
wrappers
|__global
|__type[type:list,detail,image]
|__charset[auto,gbk,utf-8]
| __ URL [ regular URL ]
|__records[type:news,list,bidding,blog,bbs]
|__segment
I __ prefix [ segment start ]
| __ suffix [ end of segment ]
| __ token [ grouping of segments of list ]
|__field[datatype:int datatime string]
I __ prefix [ field start ]
| __ suffix [ end of field ]
| __ format [ field extraction formatting default irregular ]
L __ format [ formatting order after field extraction ]
I __ fact [ dictionary mapping ]
|__field
|__prefix
|__suffix
|__format
|__format
|__...
|__segment
Explanation:
type of wrapper and type of Segment combined use scenario
If the type of global is list, only the first segment is read, and there must be a token in the segment to make the partition. For implementing details of extraction of list pages
If the type of global is detail, subsequent segment has token, and it supports that there are more tokens with token.
vector<map<string,string>>intern_fields。
Step 103: performing segmentation processing on the webpage according to the segment start information and the segment end information to obtain one or more webpage segments;
in a specific implementation process, the webpage is segmented according to the segment start information and the segment end information of the target extraction template, and each pair of segment start information and segment end information can obtain one webpage segment, so that one or more webpage segments can be obtained.
Step 104: sequentially extracting fields of each webpage segment to obtain a plurality of fields corresponding to each webpage segment;
in a specific implementation process, after the web page is segmented to obtain the web page segments, each web page segment comprises a plurality of fields, and each web page segment is subjected to field extraction in sequence, so that each web page segment can extract a plurality of fields. Since the web page segments are all in the XML format, the fields are also in the XML format, and therefore, the fields need to be formatted to obtain the fields of the irregular expression.
Step 105: and performing dictionary mapping on the fields by using a dictionary in a database to obtain dictionary fields corresponding to the fields in the dictionary, and storing the fields into a data table corresponding to the dictionary fields to realize extraction of texts in the webpage.
In a specific implementation process, a dictionary is established in a database in advance, the dictionary comprises a plurality of tables, each table comprises a plurality of field information, the dictionary in the database is used for mapping with a field extracted from a webpage, if the field in the webpage is successfully matched with a certain field in the dictionary, the field in the webpage is stored in the successfully matched field in the dictionary, then the field in the next webpage is processed in the same way until all fields in all webpage segments are processed, and the extraction of the text in the webpage is completed.
According to the embodiment of the invention, the corresponding target extraction template is obtained according to the webpage information, the webpage segmentation and field extraction are carried out according to the target extraction template, the dictionary is utilized for mapping and then the mapping is stored in the corresponding database, so that the text information in the webpage is obtained, and the text information in the webpage can be conveniently and quickly obtained through the corresponding template.
On the basis of the above embodiment, the method further includes:
performing emotion recognition on the text in the webpage by using an emotion classification model to obtain an emotion type; the emotion categories include positive, neutral, and negative.
In a specific implementation process, after a field of a text in a webpage is acquired, emotion recognition is performed on the field in the webpage, the extracted field of the webpage is input into an emotion classification model, and the emotion classification model outputs emotion expressed by the text in the webpage. Wherein the emotion categories include positive, neutral, and negative.
According to the method and the device, the emotion recognition is carried out on the text in the webpage, whether the text in the webpage is positive, neutral or negative is judged, a preliminary judgment is provided for a user, and the user experience is improved.
On the basis of the above embodiment, after the target extraction template corresponding to the web page information is obtained, the method further includes:
and analyzing the target extraction template to obtain a corresponding XML file.
In a specific implementation process, after the target extraction template is obtained, the target extraction template needs to be subjected to XML analysis, so that an XML file corresponding to the target extraction template is obtained, and text extraction is performed on a corresponding webpage through the target extraction template in an XML file format.
On the basis of the above embodiment, the method further includes:
and if the extraction mode is judged to be automatic text extraction, matching the IP address with the IP address in the blacklist in the database, and if the matching fails, extracting the text of the webpage according to a link ratio method and/or a density-based method.
In a specific implementation process, if the user selects an automatic text extraction mode, the device matches the IP address corresponding to the webpage with the IP address in the blacklist in the database after receiving that the extraction mode selected by the user is automatic text extraction, judges that the IP addresses of the webpage are in the blacklist, if so, does not perform text extraction on the webpage, and if not, extracts the text of the webpage by using a link ratio method and/or a density-based method. It should be noted that, when the extraction mode is selected, the user may be prompted in the form of a pop-up box, or may be prompted in other forms, which is not specifically limited in the embodiment of the present invention. It should be noted that before the extraction of the text of the web page by the link ratio method and/or the density-based method, operations such as denoising and common attribute extraction are also required to be performed on the web page.
The embodiment of the invention can not only extract the webpage by extracting the template, but also can automatically extract the text, so that a user can have more selection modes, and when a new webpage appears, if no corresponding extraction template exists and the user does not establish the template, the text can be extracted by using the automatic text extraction mode.
On the basis of the above embodiment, the link ratio method includes:
carrying out code identification and conversion, tag case consistency, de-noising tag and tag arrangement on the webpage;
if the < TR > and the </TR > are empty before being judged and known, deleting the < TR > and the </TR >;
and calculating a content link ratio corresponding to the webpage by taking a table label as a separator, and if the content link ratio is greater than a preset threshold, taking the content with the content link ratio greater than the preset threshold as a text.
In a specific implementation process, the method for extracting the link ratio mainly comprises the following steps: preprocessing, useless link deletion and text recognition based on a link ratio. The preprocessing specifically comprises code identification and conversion of the webpage, tag case consistency, denoising tag and tag processing. Among these, noise labels, for example: script, style, xml, form, select, css, etc.; label collation, for example: < A …, < P …, < TABLE … and the like. The text recognition based on the link ratio is specifically as follows: and calculating the link ratio of the webpage content by taking the table label as a partition, and identifying the webpage content as a text if the link ratio is more than 20.
On the basis of the embodiment, most text areas of the web pages are areas with the most densely distributed text information, the areas are the largest but not the smallest possible, for example, the comment information is more, the text of the web pages is short, and then the problems are effectively solved through the line block length. Therefore, the text extraction can be carried out by using a density-based method, and the method comprises the following specific steps:
the method comprises the steps that a large number of labels are contained in an original HTML document in a webpage, and label removing processing is carried out on webpage content;
dividing the webpage content into a line and a line according to the line, and counting the line length of the webpage content;
taking a line number corresponding to the line length of the first line exceeding a second preset threshold value as a sudden rising point, and taking a line number behind the sudden rising point as a sudden falling point;
and taking the content between the sudden rising point and the sudden falling point as the text in the webpage.
The embodiment of the invention can not only extract the webpage by extracting the template, but also can automatically extract the text, so that a user can have more selection modes, and when a new webpage appears, if no corresponding extraction template exists and the user does not establish the template, the text can be extracted by using the automatic text extraction mode.
On the basis of the above embodiment, the method further includes:
and constructing an emotion classification model in advance, and training the emotion classification model.
Wherein, the specific process of training includes:
first, a first neural network, a second neural network and a third neural network are constructed, wherein the first neural network is used for identifying the positive emotion, the second neural network is used for identifying the neutral emotion and the third neural network is used for identifying the negative emotion. And respectively inputting the words with preset number in the training data set into the first neural network, the second neural network and the third neural network to obtain a first output result corresponding to the first neural network, a second output result corresponding to the second neural network and a third output result corresponding to the third neural network. And performing matrix transformation on the first output result, the second output result and the third output result to obtain a first output matrix, a second output matrix and a third output matrix which respectively correspond to the first output matrix, the second output matrix and the third output matrix. Then, carrying out category labeling on the emotion of the words input into the first neural network, the second neural network and the third neural network, constructing corresponding label matrixes, and then calculating a first Euclidean distance corresponding to the label matrix of the first output matrix, a second Euclidean distance corresponding to the label matrix of the second output matrix and a third Euclidean distance corresponding to the label matrix of the third output matrix. The method comprises the steps of constructing a first loss function by using a first Euclidean distance, optimizing parameters in a first neural network by using the first loss function, constructing a second loss function by using a second Euclidean distance, optimizing parameters in a second neural network by using the second loss function, constructing a third loss function by using a third Euclidean distance, and optimizing parameters in a third neural network by using the third loss function. After the first neural network, the second neural network and the third neural network are trained, the first neural network, the second neural network and the third neural network are combined into an emotion classification model. When the emotion classification model is used, each neural network outputs an emotion classification score to a field in the webpage text, and the emotion classification corresponding to the whole webpage text is obtained through weight calculation.
It should be noted that the euclidean distance is calculated by:
according to
Figure BDA0001727227360000151
Calculating Euclidean distance between two row vectors in the first matrix to obtain a first intermediate matrix
Figure BDA0001727227360000152
Wherein dakjIs the Euclidean distance between the k row vector and the j row vector in the first matrix, akIs the value of the k row element in the first matrix, ajIs the j row element value in the first matrix;
according to
Figure BDA0001727227360000153
Calculating the Euclidean distance between two row vectors in the second matrix to obtain a second intermediate matrix
Figure BDA0001727227360000154
Wherein dbkjIs the Euclidean distance between the k row vector and the j row vector in the second matrix, bkIs the value of the k row element in the second matrix, bjIs the j row element value in the second matrix;
according to
Figure BDA0001727227360000155
Calculating the intermediate Euclidean distance between the second intermediate matrix and the first intermediate matrix
Figure BDA0001727227360000161
Wherein the first matrix comprises the first output matrix, the second output matrix, or the third output matrix; the second matrix is the label matrix.
On the basis of the above embodiment, before storing the field into the data table corresponding to the dictionary field, the method further includes:
and acquiring the submission time, the author and the source corresponding to the webpage.
On the basis of the above embodiment, the method further includes:
and configuring the IP address, the port number, the table name and the user name of the database.
In a specific implementation project, in order to obtain a dictionary in a database and store fields in the database, an IP address, a port number, a table name and a user name of the database need to be configured so as to be capable of connecting to the database.
For example: for loading system-level system configuration, profile name: config.extract (note: used in version of config.extract.win 32windows)
####extract configure file
[global]
Whether the first line of # # is a description field, true: is false: whether or not
extract_mode=default
[database]
db_host=127.0.0.1
db_user=root
db_pwd=zhihai
db_name=yt
db_port=3306
db_table=yunteng_extract_template
[wrapper]
#path=/usr/local/extract/branch/wrapper
path=./wrapper
Fig. 2 is a schematic flow chart of a method for extracting a web page text according to an embodiment of the present invention, as shown in fig. 2, including:
step 201: initializing system configuration parameters; in order to obtain the dictionary in the database and store the fields in the database, the IP address, the port number, the table name and the user name of the database need to be configured so as to be able to connect to the database, step 202 is executed;
step 202: loading a template; acquiring an extraction template table from a database according to the webpage to be extracted, acquiring a target extraction template corresponding to the webpage to be extracted from the extraction template table, and analyzing the target extraction template to obtain a target extraction template in an XML format;
step 203: loading a URL blacklist; acquiring a URL blacklist table from a database, wherein the aim is to match the website of the webpage to be extracted with the website in the URL blacklist table and judge whether the website of the webpage to be extracted is in the blacklist;
step 204: inputting a single webpage; inputting a single webpage corresponding to a webpage to be extracted;
step 205: identifying a coding format; identifying the coding format of the single webpage;
step 206: an extraction mode; acquiring a mode for extracting the webpage text, wherein the extraction mode can be automatic text extraction or template extraction, if the extraction mode is the template extraction, executing a step 207, and if the extraction mode is the automatic text extraction, executing a step 214;
step 207: segmenting a webpage; segmenting the webpage according to the segment start information and the segment end information to obtain a plurality of webpage segments;
step 208: whether the reading is finished section by section; processing each webpage segment in sequence, and therefore, judging whether reading of the webpage segment is completed or not, if so, executing step 221, otherwise, executing step 209;
step 209: whether field extraction is completed or not; before extracting the field of each web page segment, it needs to determine whether the field of the web page segment is completely extracted, if so, step 213 is executed, otherwise, step 210 is executed;
step 210: extracting single character sections; extracting fields in the webpage segments one by one;
step 211: customizing and denoising; denoising the extracted field;
step 212: taking down a field; taking down a field;
step 213: taking down a fragment; if judging that the web page segments in the web page are not extracted completely, extracting the next web page segment;
step 214: a URL blacklist; judging that the website of the webpage to be extracted is matched with the website in the URL blacklist, if the matching is successful, ending the text extraction operation, otherwise, executing the step 215;
step 215: judging the URL; judging whether the URL of the webpage to be extracted is a legal URL or not, if not, ending the text extraction operation, otherwise, executing a step 216;
step 216: denoising the webpage; performing denoising processing on the webpage to be extracted;
step 217: extracting public attributes; extracting public attribute information in the webpage;
step 218: based on the density text; judging whether an extraction mode based on the density text is adopted, if so, extracting, and executing the step 220 after the extraction is finished, otherwise, executing the step 219;
step 219: based on the link ratio; if the density-based text is not selected, extracting by using a link ratio-based mode, wherein the specific extraction mode is consistent with that in the embodiment, which is not described herein again, and executing step 220 after extraction is completed;
step 220: selecting the title to be optimal; after the text extraction is completed, acquiring a title corresponding to the webpage;
step 221: automatically acquiring submission time; acquiring the text disclosure time in a webpage;
step 222: the author automatically acquires; acquiring an author corresponding to a text in a webpage;
step 223: automatically acquiring sources; and acquiring the source of the text in the webpage.
Fig. 3 is a schematic structural diagram of a web page text extraction device based on an extraction template according to an embodiment of the present invention, as shown in fig. 3, the device includes: an acquisition module 301, a template acquisition module 302, a web page segmentation module 303, a field extraction module 304, and a dictionary mapping module 305, wherein,
the acquisition module is used for acquiring webpage information of a webpage of which text information is to be extracted, and an IP address and webpage content of the webpage; the template acquisition module is used for acquiring a target extraction template corresponding to the webpage information if the extraction mode is judged to be template extraction, wherein the target extraction template comprises at least one piece of initial information and at least one piece of end information; the webpage segmentation module is used for segmenting the webpage according to the segment start information and the segment end information to obtain one or more webpage segments; the field extraction module is used for sequentially extracting fields of the webpage segments to obtain a plurality of fields corresponding to each webpage segment; and the dictionary mapping module is used for performing dictionary mapping on the fields by utilizing a dictionary in a database to obtain dictionary fields corresponding to the fields in the dictionary and storing the fields into a data table corresponding to the dictionary fields so as to realize extraction of texts in the webpage.
On the basis of the above embodiment, the apparatus further includes:
the template storage module is used for storing a plurality of extraction templates in a database in advance, and each type of webpage corresponds to one extraction template.
On the basis of the above embodiment, the apparatus further includes:
the identification module is used for carrying out emotion identification on the text in the webpage by using the emotion classification model to obtain an emotion category; the emotion categories include positive, neutral, and negative.
On the basis of the above embodiment, the apparatus further includes:
and the analysis module is used for analyzing the target extraction template to obtain a corresponding XML file.
On the basis of the above embodiment, the apparatus further includes:
and the text extraction module is used for matching the IP address with the IP address in a blacklist in the database if the extraction mode is judged to be automatic text extraction, and extracting the text of the webpage according to a link ratio method and/or a density-based method if the matching fails.
On the basis of the above embodiment, the link ratio method includes:
carrying out code identification and conversion, tag case consistency, de-noising tag and tag arrangement on the webpage;
if the < TR > and the </TR > are empty before being judged and known, deleting the < TR > and the </TR >;
and calculating a content link ratio corresponding to the webpage by taking a table label as a separator, and if the content link ratio is greater than a preset threshold, taking the content with the content link ratio greater than the preset threshold as a text.
On the basis of the above embodiment, the density-based method includes:
performing label removal processing on the webpage content;
dividing the webpage content according to rows, and counting the row length of the webpage content;
taking a line number corresponding to the line length of the first line exceeding a second preset threshold value as a sudden rising point, and taking a line number behind the sudden rising point as a sudden falling point;
and taking the content between the sudden rising point and the sudden falling point as the text in the webpage.
On the basis of the above embodiment, the apparatus further includes:
and the model training module is used for constructing an emotion classification model in advance and training the emotion classification model.
On the basis of the above embodiment, the model training module is specifically configured to:
respectively constructing a first neural network, a second neural network and a third neural network, wherein the first neural network is used for identifying positive emotion, the second neural network is used for identifying neutral emotion, and the third neural network is used for identifying negative emotion;
training the first neural network, the second neural network and the third neural network respectively by utilizing a training data set to obtain a first output result corresponding to the first neural network, a second output result corresponding to the second neural network and a third output result corresponding to the third neural network;
calculating a first loss function between a first output result and a tag result, a second loss function between the second output result and the tag result, and a third loss function between the third output result and the tag result;
optimizing parameters in the first neural network by using the first loss function, optimizing parameters in the second neural network by using the second loss function, and optimizing parameters in the third neural network by using the third loss function;
and the optimized first neural network, the optimized second neural network and the optimized third neural network form the emotion classification model.
On the basis of the above embodiment, the model training module is specifically configured to:
acquiring a first output matrix corresponding to the first output result, a second output matrix corresponding to the second output result, a third output matrix corresponding to the third output structure and a label matrix corresponding to the label result;
calculating a first Euclidean distance between the first output matrix and the label matrix according to a Euclidean distance calculation formula, and obtaining the first loss function according to the first Euclidean distance;
calculating a second Euclidean distance between the second output matrix and the label matrix according to the Euclidean distance calculation formula, and obtaining a second loss function according to the second Euclidean distance;
and calculating a third Euclidean distance between the third output matrix and the label matrix according to the Euclidean distance calculation formula, and obtaining a third loss function according to the third Euclidean distance.
On the basis of the above embodiment, the euclidean distance calculation formula includes:
according to
Figure BDA0001727227360000201
Calculating Euclidean distance between two row vectors in the first matrix to obtain a first intermediate matrix
Figure BDA0001727227360000202
Wherein dakjIs the Euclidean distance between the k row vector and the j row vector in the first matrix, akIs the value of the k row element in the first matrix, ajIs the j row element value in the first matrix;
according to
Figure BDA0001727227360000203
Calculating the Euclidean distance between two row vectors in the second matrix to obtain a second intermediate matrix
Figure BDA0001727227360000204
Wherein dbkjIs the Euclidean distance between the k row vector and the j row vector in the second matrix, bkIs the value of the k row element in the second matrix, bjIs the j row element value in the second matrix;
according to
Figure BDA0001727227360000211
Calculating the intermediate Euclidean distance between the second intermediate matrix and the first intermediate matrix
Figure BDA0001727227360000212
Wherein the first matrix comprises the first output matrix, the second output matrix, or the third output matrix; the second matrix is the label matrix.
On the basis of the above embodiment, the apparatus further includes:
the training data set acquisition module is used for acquiring a training data set in advance, wherein the training data set comprises a plurality of words, performing emotion marking on the words, and constructing a label matrix according to the marked words.
On the basis of the above embodiment, the apparatus further includes:
and the information acquisition module is used for acquiring the submission time, the author and the source corresponding to the webpage.
On the basis of the above embodiment, the apparatus further includes:
and the system configuration module is used for configuring the IP address, the port number, the table name and the user name of the database.
On the basis of the above embodiment, the apparatus further includes:
and the formatting module is used for carrying out formatting processing on the field.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method, and will not be described in too much detail herein.
In summary, in the embodiments of the present invention, the corresponding target extraction template is obtained according to the webpage information, the webpage is segmented and the field is extracted according to the target extraction template, and the dictionary mapping is performed and then the extracted template is stored in the corresponding database, so that the text information in the webpage is obtained, and the text information in the webpage can be conveniently and quickly obtained through the corresponding template.
Referring to fig. 4, fig. 4 is a block diagram of an electronic device according to an embodiment of the present invention. The electronic device may comprise a web page text extraction means 401, a memory 402, a memory controller 403, a processor 404, a peripheral interface 405, an input output unit 406, an audio unit 407, a display unit 408.
The memory 402, the memory controller 403, the processor 404, the peripheral interface 405, the input/output unit 406, the audio unit 407, and the display unit 408 are electrically connected to each other directly or indirectly, so as to implement data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The netpage text extraction device 401 includes at least one software functional module which can be stored in the memory 402 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the netpage text extraction device 401. The processor 404 is configured to execute an executable module stored in the memory 402, such as a software functional module or a computer program included in the web page text extraction apparatus 401.
The Memory 402 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory 402 is used for storing a program, and the processor 404 executes the program after receiving an execution instruction, and the method executed by the server defined by the flow process disclosed in any of the foregoing embodiments of the present invention may be applied to the processor 404, or implemented by the processor 404.
The processor 404 may be an integrated circuit chip having signal processing capabilities. The Processor 404 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor 404 may be any conventional processor or the like.
The peripheral interface 405 couples various input/output devices to the processor 404 and to the memory 402. In some embodiments, the peripheral interface 405, the processor 404, and the memory controller 403 may be implemented in a single chip. In other examples, they may be implemented separately from the individual chips.
The input and output unit 406 is used for providing input data for a user to realize the interaction of the user with the server (or the local terminal). The input/output unit 406 may be, but is not limited to, a mouse, a keyboard, and the like.
Audio unit 407 provides an audio interface to the user, which may include one or more microphones, one or more speakers, and audio circuitry.
The display unit 408 provides an interactive interface (e.g., a user interface) between the electronic device and a user or for displaying image data to a user reference. In this embodiment, the display unit 408 may be a liquid crystal display or a touch display. In the case of a touch display, the display can be a capacitive touch screen or a resistive touch screen, which supports single-point and multi-point touch operations. Supporting single-point and multi-point touch operations means that the touch display can sense touch operations from one or more locations on the touch display at the same time, and the sensed touch operations are sent to the processor 404 for calculation and processing.
The peripheral interface 405 couples various input/output devices to the processor 404 and to the memory 402. In some embodiments, the peripheral interface 405, the processor 404, and the memory controller 403 may be implemented in a single chip. In other examples, they may be implemented separately from the individual chips.
The input and output unit 406 is used for providing input data for a user to realize the interaction of the user with the processing terminal. The input/output unit 406 may be, but is not limited to, a mouse, a keyboard, and the like.
It will be appreciated that the configuration shown in fig. 4 is merely illustrative and that the electronic device may include more or fewer components than shown in fig. 4 or may have a different configuration than shown in fig. 4. The components shown in fig. 4 may be implemented in hardware, software, or a combination thereof.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (6)

1. A webpage text extraction method based on an extraction template is characterized by comprising the following steps:
acquiring webpage information of a webpage of which text information is to be extracted, and an IP address and webpage content of the webpage;
if the extraction mode is judged to be template extraction, acquiring a target extraction template corresponding to the webpage information, wherein the target extraction template comprises at least one piece of initial information and at least one piece of end information;
performing segmentation processing on the webpage according to the segment start information and the segment end information to obtain one or more webpage segments;
sequentially extracting fields of each webpage segment to obtain a plurality of fields corresponding to each webpage segment;
performing dictionary mapping on the fields by using a dictionary in a database to obtain dictionary fields corresponding to the fields in the dictionary, and storing the fields into a data table corresponding to the dictionary fields to realize extraction of texts in the webpage;
the method further comprises the following steps:
if the extraction mode is judged to be automatic text extraction, matching the IP address with the IP address in a blacklist in a database, and if the matching fails, extracting the text of the webpage according to a link ratio method or a density-based method; judging whether the density method-based extraction mode is adopted or not, if so, extracting the text of the webpage by using the density method, and if not, extracting the text of the webpage by using the link comparison method;
the density-based method includes:
performing label removal processing on the webpage content;
dividing the webpage content according to rows, and counting the row length of the webpage content;
taking a line number corresponding to the line length of the first line exceeding a second preset threshold value as a sudden rising point, and taking a line number behind the sudden rising point as a sudden falling point;
taking the content between the sudden rising point and the sudden falling point as a text in the webpage;
the link ratio method comprises the following steps:
carrying out code identification and conversion, tag case consistency, de-noising tag and tag arrangement on the webpage;
if the < TR > and the </TR > are empty before being judged and known, deleting the < TR > and the </TR >;
calculating a content link ratio corresponding to the webpage by taking a table label as a separator, and if the content link ratio is greater than a preset threshold, taking the content with the content link ratio greater than the preset threshold as a text;
the method further comprises the following steps:
performing emotion recognition on the text in the webpage by using an emotion classification model to obtain emotion categories, wherein the emotion categories comprise positive, neutral and negative;
the method further comprises the following steps: the emotion classification model is constructed in advance and is trained;
the pre-constructing of the emotion classification model and the training of the emotion classification model comprise:
respectively constructing a first neural network, a second neural network and a third neural network, wherein the first neural network is used for identifying positive emotions, the second neural network is used for identifying neutral emotions, and the third neural network is used for identifying negative emotions;
training the first neural network, the second neural network and the third neural network respectively by utilizing a training data set to obtain a first output result corresponding to the first neural network, a second output result corresponding to the second neural network and a third output result corresponding to the third neural network;
calculating a first loss function between the first output result and the tag result, a second loss function between the second output result and the tag result, and a third loss function between the third output result and the tag result;
optimizing parameters in the first neural network by using the first loss function, optimizing parameters in the second neural network by using the second loss function, and optimizing parameters in the third neural network by using the third loss function; the optimized first neural network, the optimized second neural network and the optimized third neural network form the emotion classification model;
the method for carrying out emotion recognition on the text in the webpage by using the emotion classification model to obtain the emotion classification comprises the following steps:
respectively outputting an emotion category score for a field in the webpage body by using the first neural network, the second neural network and the third neural network in the emotion classification model;
and calculating according to the emotion category scores and the corresponding weights to obtain the emotion categories corresponding to the whole webpage text.
2. The method of claim 1, further comprising:
a plurality of extraction templates are stored in a database in advance, and each type of webpage corresponds to one extraction template.
3. The method according to claim 1, wherein after obtaining the target extraction template corresponding to the web page information, the method further comprises:
and analyzing the target extraction template to obtain a corresponding XML file.
4. A web page text extracting device based on an extracting template is characterized by comprising:
the acquisition module is used for acquiring webpage information of a webpage of which text information is to be extracted, and an IP address and webpage content of the webpage;
the template acquisition module is used for acquiring a target extraction template corresponding to the webpage information if the extraction mode is judged to be template extraction, wherein the target extraction template comprises at least one piece of initial information and at least one piece of end information;
the webpage segmentation module is used for segmenting the webpage according to the segment start information and the segment end information to obtain one or more webpage segments;
the field extraction module is used for sequentially extracting fields of the webpage segments to obtain a plurality of fields corresponding to each webpage segment;
the dictionary mapping module is used for performing dictionary mapping on the fields by utilizing a dictionary in a database to obtain dictionary fields corresponding to the fields in the dictionary and storing the fields into a data table corresponding to the dictionary fields so as to realize extraction of texts in the webpage;
the device, still include:
the text extraction module is used for matching the IP address with the IP address in a blacklist in a database if the extraction mode is judged to be automatic text extraction, and extracting the text of the webpage according to a link ratio method or a density-based method if the matching fails; judging whether the density method-based extraction mode is adopted or not, if so, extracting the text of the webpage by using the density method, and if not, extracting the text of the webpage by using the link comparison method;
the density-based method includes:
performing label removal processing on the webpage content;
dividing the webpage content according to rows, and counting the row length of the webpage content;
taking a line number corresponding to the line length of the first line exceeding a second preset threshold value as a sudden rising point, and taking a line number behind the sudden rising point as a sudden falling point;
taking the content between the sudden rising point and the sudden falling point as a text in the webpage;
the link ratio method comprises the following steps:
carrying out code identification and conversion, tag case consistency, de-noising tag and tag arrangement on the webpage;
if the < TR > and the </TR > are empty before being judged and known, deleting the < TR > and the </TR >;
calculating a content link ratio corresponding to the webpage by taking a table label as a separator, and if the content link ratio is greater than a preset threshold, taking the content with the content link ratio greater than the preset threshold as a text;
the device, still include: the identification module is used for carrying out emotion identification on the text in the webpage by using an emotion classification model to obtain emotion categories, wherein the emotion categories comprise positive, neutral and negative;
the device, still include: the model training module is used for constructing the emotion classification model in advance and training the emotion classification model;
the model training module is specifically configured to:
respectively constructing a first neural network, a second neural network and a third neural network, wherein the first neural network is used for identifying positive emotions, the second neural network is used for identifying neutral emotions, and the third neural network is used for identifying negative emotions;
training the first neural network, the second neural network and the third neural network respectively by utilizing a training data set to obtain a first output result corresponding to the first neural network, a second output result corresponding to the second neural network and a third output result corresponding to the third neural network;
calculating a first loss function between the first output result and a tag result, a second loss function between the second output result and the tag result, and a third loss function between the third output result and the tag result;
optimizing parameters in the first neural network by using the first loss function, optimizing parameters in the second neural network by using the second loss function, and optimizing parameters in the third neural network by using the third loss function; the optimized first neural network, the optimized second neural network and the optimized third neural network form the emotion classification model;
the identification module utilizes an emotion classification model to carry out emotion identification on the text in the webpage to obtain emotion categories, and the emotion categories comprise:
respectively outputting an emotion category score for a field in the webpage body by using the first neural network, the second neural network and the third neural network in the emotion classification model;
and calculating according to the emotion category scores and the corresponding weights to obtain the emotion categories corresponding to the whole webpage text.
5. An electronic device, comprising: a processor, a memory, and a bus, wherein,
the processor and the memory are communicated with each other through the bus;
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1-3.
6. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1-3.
CN201810760576.6A 2018-07-11 2018-07-11 Webpage text extraction method and device based on extraction template Active CN109033282B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810760576.6A CN109033282B (en) 2018-07-11 2018-07-11 Webpage text extraction method and device based on extraction template

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810760576.6A CN109033282B (en) 2018-07-11 2018-07-11 Webpage text extraction method and device based on extraction template

Publications (2)

Publication Number Publication Date
CN109033282A CN109033282A (en) 2018-12-18
CN109033282B true CN109033282B (en) 2021-07-23

Family

ID=64641837

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810760576.6A Active CN109033282B (en) 2018-07-11 2018-07-11 Webpage text extraction method and device based on extraction template

Country Status (1)

Country Link
CN (1) CN109033282B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111858963B (en) * 2020-07-28 2024-02-23 中国银行股份有限公司 Webpage customer service knowledge extraction method and device
CN112668316A (en) * 2020-11-17 2021-04-16 国家计算机网络与信息安全管理中心 word document key information extraction method
CN113378088B (en) * 2021-06-24 2024-01-19 中国电子信息产业集团有限公司第六研究所 Webpage text extraction method, device, equipment and storage medium
CN113407599A (en) * 2021-06-30 2021-09-17 上海万物新生环保科技集团有限公司 Text data based standardized processing method and equipment
CN113434748A (en) * 2021-07-19 2021-09-24 湖南四方天箭信息科技有限公司 Template annotation based distributed crawler method and device, computer device and computer readable storage medium
CN114120302B (en) * 2021-11-23 2023-04-21 无锡医迈德科技有限公司 Method for extracting structured information from form image

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102254014A (en) * 2011-07-21 2011-11-23 华中科技大学 Adaptive information extraction method for webpage characteristics
CN106802899A (en) * 2015-11-26 2017-06-06 北京搜狗科技发展有限公司 web page text extracting method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102810097B (en) * 2011-06-02 2016-03-02 高德软件有限公司 Webpage text content extracting method and device
CN103491116A (en) * 2012-06-12 2014-01-01 深圳市世纪光速信息技术有限公司 Method and device for processing text-related structural data
CN106446072B (en) * 2016-09-07 2019-10-18 百度在线网络技术(北京)有限公司 The treating method and apparatus of web page contents

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102254014A (en) * 2011-07-21 2011-11-23 华中科技大学 Adaptive information extraction method for webpage characteristics
CN106802899A (en) * 2015-11-26 2017-06-06 北京搜狗科技发展有限公司 web page text extracting method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于统计的中文网页正文抽取的研究;赵文等;《电脑知识与技术》;20080514(第01期);第120-123页 *

Also Published As

Publication number Publication date
CN109033282A (en) 2018-12-18

Similar Documents

Publication Publication Date Title
CN109033282B (en) Webpage text extraction method and device based on extraction template
WO2019200806A1 (en) Device for generating text classification model, method, and computer readable storage medium
US20150067476A1 (en) Title and body extraction from web page
WO2014153457A1 (en) Merging web page style addresses
CN111079043A (en) Key content positioning method
CN109492177B (en) web page blocking method based on web page semantic structure
CN108536868B (en) Data processing method and device for short text data on social network
CN109165373B (en) Data processing method and device
CN111737623A (en) Webpage information extraction method and related equipment
CN106547895B (en) Webpage information extraction method and device
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN107436931B (en) Webpage text extraction method and device
CN114818680A (en) Method and device for identifying webpage text and related equipment
CN111339457B (en) Method and apparatus for extracting information from web page and storage medium
CN111046627A (en) Chinese character display method and system
CN113419721A (en) Web-based expression editing method, device, equipment and storage medium
CN111061975B (en) Method and device for processing irrelevant content in page
EP2916238A1 (en) Corpus generating device, corpus generating method, and corpus generating program
CN107908792B (en) Information pushing method and device
CN112579937A (en) Character highlight display method and device
CN113806667B (en) Method and system for supporting webpage classification
CN107451215B (en) Feature text extraction method and device
CN111144122A (en) Evaluation processing method, evaluation processing device, computer system, and medium
CN115546815A (en) Table identification method, device, equipment and storage medium
CN115186240A (en) Social network user alignment method, device and medium based on relevance information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant