CN114564638A

CN114564638A - News collection and automatic extraction method based on depth map neural network

Info

Publication number: CN114564638A
Application number: CN202210109381.1A
Authority: CN
Inventors: 何宇轩; 牟昊; 李旭日; 徐亚波
Original assignee: Guangdong Hengqin Shushushuo Story Information Technology Co ltd
Current assignee: Guangdong Hengqin Shushushuo Story Information Technology Co ltd
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-05-31

Abstract

The invention relates to the technical field of deep learning, and discloses a news collection and automatic extraction method based on a deep map neural network, which comprises the following steps: s1, collecting news sites for training; s2, marking the collected news links in the news site by taking the HTML label as a unit, and training through a depth map neural network to obtain a link extraction model; s3, marking the collected news contents in the news links by taking the HTML tags as units, and constructing a news content extraction model; s4, marking the text content in the collected news content, and constructing a text classification model; s5, collecting main page HTML source codes of the news sites to be analyzed, inputting the main page HTML source codes into a link extraction model, and obtaining news links; inputting the news link into a news content extraction model to obtain news content; and finally, inputting the news content into a text classification model and extracting news information. The invention solves the problems that the prior art is complex in calculation and cannot conveniently and quickly extract important information in news.

Description

News collection and automatic extraction method based on depth map neural network

Technical Field

The invention relates to the technical field of deep learning, in particular to a news collecting and automatic extracting method based on a deep map neural network.

Background

In the big data era, hot news is endlessly produced and is not purposive, and if a large number of news events are to be analyzed and processed, two technical problems are involved: first, find links to a large number of news. And secondly, performing structured analysis on the news to extract information such as title, content, author, publication time and the like. Currently, most of the technologies focus on extracting news texts, and how to obtain a large amount of news link research is not enough. Most news text extraction technologies finish such work through regular expressions or webpage templates, and the technologies can be competent for the work, but have the following disadvantages: the construction of the webpage template or the regular expression consumes a great deal of manpower, and when the website is subjected to revising, the webpage template or the regular expression needs to be revised again, which wastes time and labor. In addition, the partial technology extracts the news text by calculating the characteristics of text density and the like, and has the following disadvantages: first, only the text content can be extracted. And secondly, errors are easily extracted for the web pages with short news texts or excessive interference information of other web pages.

Aiming at the defects and shortcomings, the prior art discloses a news webpage text extraction system and method based on multi-modal machine learning, and the news webpage text extraction method based on multi-modal machine learning comprises the following steps: extracting different types of features; multimodal fusion, performing joint representation of features; and (5) carrying out webpage text classification model training. However, in the prior art, the calculation is complex, important information in news cannot be extracted conveniently and quickly, and the problem of difficulty in obtaining news links cannot be solved. Therefore, how to invent an automatic news extraction method which can conveniently and quickly acquire a large number of news links from a website and extract news information is a problem which needs to be solved urgently in the technical field.

Disclosure of Invention

The invention provides a method for collecting and automatically extracting news based on a deep map neural network, aiming at solving the problems that the prior art is complex in calculation and cannot conveniently and quickly extract important information in the news, and the method has the characteristics of simplicity in calculation, high efficiency and convenience.

In order to achieve the purpose of the invention, the technical scheme is as follows:

a news collection and automatic extraction method based on a depth map neural network comprises the following specific steps:

s1, collecting news sites for training;

s2, marking the collected news links in the news site by taking the HTML label as a unit, and training through a depth map neural network to obtain a link extraction model;

s3, marking the collected news contents in the news links by taking the HTML tags as units, and constructing a news content extraction model;

s4, marking the text content in the collected news content, and constructing a text classification model;

s5, collecting main page HTML source codes of the news sites to be analyzed, inputting the main page HTML source codes into the link extraction model, and obtaining news links; then inputting the obtained news link into a news content extraction model to obtain news content; and finally, inputting the obtained news content into a text classification model, and extracting news information.

Preferably, in step S2, the specific steps are:

s201, respectively marking 'plates' in a site HTML page by taking an HTML tag as a unit, and marking a 'news link' tag in each 'plate';

S202, constructing a first node classification model of the heteromorphic graph through the obtained news link label, and taking the first node classification model as a link extraction model;

and S203, training and obtaining the trained link extraction model.

Further, step S202 specifically includes; and constructing a first node classification model of the heteromorphic graph by taking the tags in the HTML source codes as nodes of the depth map neural network, taking the parent-child relationship and the brother relationship of the tags in the HTML source codes as edges of the depth map neural network, taking the attributes and characters in the tags as the characteristics of the nodes, and taking the first node classification model as a link extraction model. The depth map neural network may be selected from a graph convolution network algorithm (GCN).

Further, in step S203, specifically: and taking the extraction task of the news link as a node classification task of the deep map neural network, dividing the labeled data in the S201 into a training set, a verification set and a test set, and training the link extraction model.

Further, in step S3, the specific steps are:

s301, marking the 'news link' collected in the step S2, and respectively marking a 'title' label, a 'release time' label, an 'author' label, a 'content' label and a 'source' label in the news text page by taking an HTML (hypertext markup language) label of the news text page as a unit;

S302, constructing a second node classification model of the heteromorphic graph, and taking the second node classification model as a news information extraction model;

and S303, training the news information extraction model.

Further, step S302 specifically includes: constructing a node classification model of the heteromorphic graph by taking a tag in an HTML source code as a node of the depth map neural network, taking a parent-child relationship and a brother relationship of the tag in the HTML source code as edges of the depth map neural network, taking attributes and characters in the tag as characteristics of the node, and taking a second node classification model as a news information extraction model;

further, step S303 specifically includes: and taking the task of extracting the news information as a node classification task of the depth map neural network, dividing the labeled data in the S301 into a training set, a verification set and a test set, and training the news information extraction model.

Further, in step S4, the specific steps are:

s401, marking the contents obtained in the step S3, wherein each news content is marked as a news label or a noise label;

s402, obtaining and training a text classification model according to the collected news labels or noise labels.

Further, in step S401, the "noise" tag includes recruitment information, advertisement, and news website introduction.

Further, in step S402, a text classification model is built and trained by a natural language algorithm according to the collected "news" labels or "noise" labels. The text classification model may optionally be constructed in a fine-tune (fine-tune) based pre-trained model.

The invention has the following beneficial effects:

according to the invention, the link extraction model, the news content extraction model and the text classification model are obtained by collecting and marking news sites for training, so that the automatic extraction of news content is realized, the problems that the prior art is complex in calculation and cannot conveniently and quickly extract important information in news are solved, and the method has the characteristics of simplicity in calculation, high efficiency and convenience.

Drawings

FIG. 1 is a flow chart diagram of a news collecting and automatic extracting method based on a depth map neural network.

FIG. 2 is a labeling result of the method for collecting news and automatically extracting HTML source code based on the depth map neural network.

FIG. 3 is an exemplary diagram of a depth map neural network of the present method of news gathering and automated extraction based on a depth map neural network.

FIG. 4 is a flow chart of the method for collecting news and automatically extracting news based on the depth map neural network, which is used for a specific portal site to collect news and automatically extract news

Detailed Description

The invention is described in detail below with reference to the drawings and the detailed description.

Example 1

As shown in fig. 1, a method for collecting news and automatically extracting the news based on a depth map neural network includes the following steps:

s1, collecting news sites for training; in the embodiment, through the means of the prior art, the home page HTML source code of the news website is collected and used as the annotation data, and 1000 news home page webpages are collected;

Example 2

s1, collecting news sites for training;

As shown in fig. 2, in an embodiment, in step S2, the specific steps are:

s201, with an HTML tag as a unit, a marking person can enter a webpage with a large number of news links after judging and clicking the tag by experience, wherein the tags of 'news', 'video', 'sports', 'science and technology' are marked as 'plate' tags by the marking person; the annotator "clicks" news "related labels in the respective" panels "and takes" xx ceremonies "in the city of xx yesterday and" the world health organization: new cases of coronary pneumonia break through xxx "and are marked as a" news text "label; in the embodiment, the annotating personnel annotate in batches through expressions of the CSS selector;

and S203, training and obtaining a trained link extraction model.

As shown in fig. 3, in a specific embodiment, step S202 is specifically; the method comprises the steps of constructing a first node classification model of a heteromorphic graph by taking a tag in an HTML source code as a node of a depth map neural network, taking a parent-child relation and a brother relation of the tag in the HTML source code as edges of the depth map neural network, taking attributes and characters in the tag as characteristics of the node, taking the first node classification model as a link extraction model, and testing to obtain the link accuracy and the recall rate of the model extraction, wherein the link accuracy and the recall rate can reach about 95%.

In an embodiment, step S203 specifically includes: and taking the extraction task of the news link as a node classification task of the deep map neural network, dividing the labeled data in the S201 into a training set, a verification set and a test set, and training the link extraction model.

Example 3

S1, collecting news sites for training;

As shown in fig. 2, in an embodiment, in step S2, the specific steps are:

and S203, training and obtaining a trained link extraction model.

As shown in fig. 3, in an embodiment, step S202 is specifically; and constructing a first node classification model of the heteromorphic graph by taking the tags in the HTML source codes as nodes of the depth map neural network, taking the parent-child relationship and the brother relationship of the tags in the HTML source codes as edges of the depth map neural network, taking the attributes and characters in the tags as the characteristics of the nodes, and taking the first node classification model as a link extraction model. 1000 depth map neural networks were obtained in the present example.

The attributes of the neural network node of the depth map in this embodiment include a tag type, an id attribute, a class attribute, an href attribute, a location where the tag is located, the number of sub-tags, a sub-tag type, and the like of the HTML tag.

In this embodiment, the positions of the tags may be sorted by the appearance order and then used as the features.

The number of sub-tags in this embodiment is, that is, the number of tags using the HTML tag as a parent node; the sub-label type is characterized by whether the sub-label contains characters or links.

In this embodiment, 800 depth map neural networks are randomly allocated as a training set, 100 depth map neural networks are used as a verification set, and the remaining 100 depth map neural networks are used as a test set, and a depth map neural network is established for model training to obtain a link extraction model.

In the embodiment, a GCNN deep neural network is adopted as a depth map neural network model architecture.

In an embodiment, in step S3, the specific steps are:

s301, marking the 'news link' collected in the step S2, and respectively marking a 'title' label, a 'release time' label, an 'author' label, a 'content' label and a 'source' label in the news text page by taking an HTML (hypertext markup language) label of the news text page as a unit; in the embodiment, 1000 news links in S003 are extracted for annotation; in this embodiment, the news text tag is generally a < p > tag, and is a plurality of continuous tags, which generally occupy the largest space of the web page; the headline tag is generally above the news body tag; author tags, source tags, generally located between the news body tag and the title tag, or immediately after the news body tag; the publication time tag is generally located between the news body tag and the title tag; in the embodiment, expressions of a CSS selector are adopted for batch marking;

and S303, training the news information extraction model.

In an embodiment, step S302 specifically includes: the method comprises the steps of constructing a node classification model of a heteromorphic graph by taking a tag in an HTML source code as a node of a depth graph neural network, taking a parent-child relationship and a brother relationship of the tag in the HTML source code as edges of the depth graph neural network, taking attributes and characters in the tag as characteristics of the node, taking a second node classification model as a news information extraction model, and testing to ensure that the accuracy and recall rate of the model for structured extraction of news can reach more than 90%;

In a specific embodiment, step S303 specifically includes: and taking the task of extracting the news information as a node classification task of the depth map neural network, dividing the labeled data in the S301 into a training set, a verification set and a test set, and training the news information extraction model.

In this embodiment, 3000 pieces of data are labeled, and it is ensured that the number of data labeled with "news" and the number of data labeled with "noise" is not less than 1000.

Example 4

s1, collecting news sites for training;

As shown in fig. 2, in an embodiment, in step S2, the specific steps are:

and S203, training and obtaining the trained link extraction model.

As shown in fig. 3, in a specific embodiment, step S202 is specifically; and constructing a first node classification model of the heteromorphic graph by taking the tags in the HTML source codes as nodes of the depth map neural network, taking the parent-child relationship and the brother relationship of the tags in the HTML source codes as edges of the depth map neural network, taking the attributes and characters in the tags as the characteristics of the nodes, and taking the first node classification model as a link extraction model.

In one embodiment, step S3 includes the following steps:

and S303, training the news information extraction model.

In an embodiment, step S302 specifically includes: constructing a node classification model of the heteromorphic graph by taking a tag in an HTML source code as a node of the depth map neural network, taking a parent-child relationship and a brother relationship of the tag in the HTML source code as edges of the depth map neural network, taking attributes and characters in the tag as characteristics of the node, and taking a second node classification model as a news information extraction model;

in an embodiment, step S303 specifically includes: and taking the task of extracting the news information as a node classification task of the depth map neural network, dividing the labeled data in the S301 into a training set, a verification set and a test set, and training the news information extraction model.

In one embodiment, step S4 includes the following steps:

In one embodiment, in step S401, the "noise" tag includes recruitment information, advertisement, news website introduction.

In one embodiment, in step S402, a text classification model is established and trained by a natural language algorithm according to the collected "news" tags or "noise" tags, and the precision rate and recall rate of the test classification can reach more than 95%.

In this embodiment, as shown in a specific flowchart, in fig. 4, for example, data collection of a large portal news website is used, HTML source codes of a homepage of the large portal news website are collected every week, are converted into a depth map neural network, are input into a link extraction model, and corresponding plate links and news text links in the homepage are output; and recording the obtained plate link into the acquisition record of the homepage, acquiring an HTML source code of the link if the plate link does not appear in the acquisition record, converting the HTML source code into a depth map neural network, inputting the depth map neural network into a link extraction model, outputting the corresponding plate link and news text link in the webpage by the model, and repeatedly executing acquisition and prediction tasks for the plate link.

In this embodiment, the extracted news text link is used to collect the HTML source code of the link, convert the HTML source code into a depth map neural network, input the depth map neural network into a news information extraction model, and output the corresponding news text, title, publication date, author, and source.

In this embodiment, the text classification model is input to the extracted news text, and the news information that is output as the "news" classification is stored. So far, news extraction of the news website is completed, and a batch of news and structured data corresponding to the news, such as title, text, author, publication time and source, are obtained.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A news collection and automatic extraction method based on a depth map neural network is characterized by comprising the following steps: the method comprises the following specific steps:

S1, collecting news sites for training;

s3, marking the collected news contents in the news links by taking the HTML labels as units, and constructing a news content extraction model;

2. The method for collecting and automatically extracting news based on the depth map neural network as claimed in claim 1, wherein: step S2, the specific steps are:

And S203, training the link extraction model, and obtaining the trained link extraction model.

3. The method for collecting and automatically extracting news based on the depth map neural network as claimed in claim 2, wherein the method comprises the following steps: step S202, specifically, the method comprises the following steps; and constructing a first node classification model of the heteromorphic graph by taking the tags in the HTML source codes as nodes of the depth map neural network, taking the parent-child relationship and the brother relationship of the tags in the HTML source codes as edges of the depth map neural network, taking the attributes and characters in the tags as the characteristics of the nodes, and taking the first node classification model as a link extraction model.

4. The method for collecting and automatically extracting news based on the depth map neural network as claimed in claim 3, wherein the method comprises the following steps: step S203, specifically: and taking the extraction task of the news link as a node classification task of the deep map neural network, dividing the labeled data in the S201 into a training set, a verification set and a test set, and training the link extraction model.

5. The method of claim 4, wherein the method comprises: step S3, the specific steps are:

and S303, training the news information extraction model.

6. The method of claim 5, wherein the method comprises: step S302, specifically: constructing a node classification model of the heteromorphic graph by taking a tag in an HTML source code as a node of the depth map neural network, taking a parent-child relationship and a brother relationship of the tag in the HTML source code as edges of the depth map neural network, taking attributes and characters in the tag as characteristics of the node, and taking a second node classification model as a news information extraction model;

7. the method of claim 6, wherein the method comprises: step S303, specifically: and taking the task of extracting the news information as a node classification task of the depth map neural network, dividing the labeled data in the S301 into a training set, a verification set and a test set, and training the news information extraction model.

8. The method of claim 7, wherein the method comprises: step S4, the specific steps are:

S401, labeling the contents obtained in the step S3, wherein each news content is labeled as a news label or a noise label;

s402, training a text classification model according to the collected news labels or noise labels.

9. The method for collecting and automatically extracting news based on the depth map neural network as claimed in claim 8, wherein: in step S401, the "murmur" label includes recruitment information, advertisements, and news website introductions.

10. The method of claim 9, wherein the method comprises: in step S402, a text classification model is established and trained by a natural language algorithm according to the collected "news" labels or "noise" labels.