CN114564638A - News collection and automatic extraction method based on depth map neural network - Google Patents

News collection and automatic extraction method based on depth map neural network Download PDF

Info

Publication number
CN114564638A
CN114564638A CN202210109381.1A CN202210109381A CN114564638A CN 114564638 A CN114564638 A CN 114564638A CN 202210109381 A CN202210109381 A CN 202210109381A CN 114564638 A CN114564638 A CN 114564638A
Authority
CN
China
Prior art keywords
news
taking
neural network
link
depth map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210109381.1A
Other languages
Chinese (zh)
Inventor
何宇轩
牟昊
李旭日
徐亚波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Hengqin Shushushuo Story Information Technology Co ltd
Original Assignee
Guangdong Hengqin Shushushuo Story Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Hengqin Shushushuo Story Information Technology Co ltd filed Critical Guangdong Hengqin Shushushuo Story Information Technology Co ltd
Priority to CN202210109381.1A priority Critical patent/CN114564638A/en
Publication of CN114564638A publication Critical patent/CN114564638A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to the technical field of deep learning, and discloses a news collection and automatic extraction method based on a deep map neural network, which comprises the following steps: s1, collecting news sites for training; s2, marking the collected news links in the news site by taking the HTML label as a unit, and training through a depth map neural network to obtain a link extraction model; s3, marking the collected news contents in the news links by taking the HTML tags as units, and constructing a news content extraction model; s4, marking the text content in the collected news content, and constructing a text classification model; s5, collecting main page HTML source codes of the news sites to be analyzed, inputting the main page HTML source codes into a link extraction model, and obtaining news links; inputting the news link into a news content extraction model to obtain news content; and finally, inputting the news content into a text classification model and extracting news information. The invention solves the problems that the prior art is complex in calculation and cannot conveniently and quickly extract important information in news.

Description

News collection and automatic extraction method based on depth map neural network
Technical Field
The invention relates to the technical field of deep learning, in particular to a news collecting and automatic extracting method based on a deep map neural network.
Background
In the big data era, hot news is endlessly produced and is not purposive, and if a large number of news events are to be analyzed and processed, two technical problems are involved: first, find links to a large number of news. And secondly, performing structured analysis on the news to extract information such as title, content, author, publication time and the like. Currently, most of the technologies focus on extracting news texts, and how to obtain a large amount of news link research is not enough. Most news text extraction technologies finish such work through regular expressions or webpage templates, and the technologies can be competent for the work, but have the following disadvantages: the construction of the webpage template or the regular expression consumes a great deal of manpower, and when the website is subjected to revising, the webpage template or the regular expression needs to be revised again, which wastes time and labor. In addition, the partial technology extracts the news text by calculating the characteristics of text density and the like, and has the following disadvantages: first, only the text content can be extracted. And secondly, errors are easily extracted for the web pages with short news texts or excessive interference information of other web pages.
Aiming at the defects and shortcomings, the prior art discloses a news webpage text extraction system and method based on multi-modal machine learning, and the news webpage text extraction method based on multi-modal machine learning comprises the following steps: extracting different types of features; multimodal fusion, performing joint representation of features; and (5) carrying out webpage text classification model training. However, in the prior art, the calculation is complex, important information in news cannot be extracted conveniently and quickly, and the problem of difficulty in obtaining news links cannot be solved. Therefore, how to invent an automatic news extraction method which can conveniently and quickly acquire a large number of news links from a website and extract news information is a problem which needs to be solved urgently in the technical field.
Disclosure of Invention
The invention provides a method for collecting and automatically extracting news based on a deep map neural network, aiming at solving the problems that the prior art is complex in calculation and cannot conveniently and quickly extract important information in the news, and the method has the characteristics of simplicity in calculation, high efficiency and convenience.
In order to achieve the purpose of the invention, the technical scheme is as follows:
a news collection and automatic extraction method based on a depth map neural network comprises the following specific steps:
s1, collecting news sites for training;
s2, marking the collected news links in the news site by taking the HTML label as a unit, and training through a depth map neural network to obtain a link extraction model;
s3, marking the collected news contents in the news links by taking the HTML tags as units, and constructing a news content extraction model;
s4, marking the text content in the collected news content, and constructing a text classification model;
s5, collecting main page HTML source codes of the news sites to be analyzed, inputting the main page HTML source codes into the link extraction model, and obtaining news links; then inputting the obtained news link into a news content extraction model to obtain news content; and finally, inputting the obtained news content into a text classification model, and extracting news information.
Preferably, in step S2, the specific steps are:
s201, respectively marking 'plates' in a site HTML page by taking an HTML tag as a unit, and marking a 'news link' tag in each 'plate';
S202, constructing a first node classification model of the heteromorphic graph through the obtained news link label, and taking the first node classification model as a link extraction model;
and S203, training and obtaining the trained link extraction model.
Further, step S202 specifically includes; and constructing a first node classification model of the heteromorphic graph by taking the tags in the HTML source codes as nodes of the depth map neural network, taking the parent-child relationship and the brother relationship of the tags in the HTML source codes as edges of the depth map neural network, taking the attributes and characters in the tags as the characteristics of the nodes, and taking the first node classification model as a link extraction model. The depth map neural network may be selected from a graph convolution network algorithm (GCN).
Further, in step S203, specifically: and taking the extraction task of the news link as a node classification task of the deep map neural network, dividing the labeled data in the S201 into a training set, a verification set and a test set, and training the link extraction model.
Further, in step S3, the specific steps are:
s301, marking the 'news link' collected in the step S2, and respectively marking a 'title' label, a 'release time' label, an 'author' label, a 'content' label and a 'source' label in the news text page by taking an HTML (hypertext markup language) label of the news text page as a unit;
S302, constructing a second node classification model of the heteromorphic graph, and taking the second node classification model as a news information extraction model;
and S303, training the news information extraction model.
Further, step S302 specifically includes: constructing a node classification model of the heteromorphic graph by taking a tag in an HTML source code as a node of the depth map neural network, taking a parent-child relationship and a brother relationship of the tag in the HTML source code as edges of the depth map neural network, taking attributes and characters in the tag as characteristics of the node, and taking a second node classification model as a news information extraction model;
further, step S303 specifically includes: and taking the task of extracting the news information as a node classification task of the depth map neural network, dividing the labeled data in the S301 into a training set, a verification set and a test set, and training the news information extraction model.
Further, in step S4, the specific steps are:
s401, marking the contents obtained in the step S3, wherein each news content is marked as a news label or a noise label;
s402, obtaining and training a text classification model according to the collected news labels or noise labels.
Further, in step S401, the "noise" tag includes recruitment information, advertisement, and news website introduction.
Further, in step S402, a text classification model is built and trained by a natural language algorithm according to the collected "news" labels or "noise" labels. The text classification model may optionally be constructed in a fine-tune (fine-tune) based pre-trained model.
The invention has the following beneficial effects:
according to the invention, the link extraction model, the news content extraction model and the text classification model are obtained by collecting and marking news sites for training, so that the automatic extraction of news content is realized, the problems that the prior art is complex in calculation and cannot conveniently and quickly extract important information in news are solved, and the method has the characteristics of simplicity in calculation, high efficiency and convenience.
Drawings
FIG. 1 is a flow chart diagram of a news collecting and automatic extracting method based on a depth map neural network.
FIG. 2 is a labeling result of the method for collecting news and automatically extracting HTML source code based on the depth map neural network.
FIG. 3 is an exemplary diagram of a depth map neural network of the present method of news gathering and automated extraction based on a depth map neural network.
FIG. 4 is a flow chart of the method for collecting news and automatically extracting news based on the depth map neural network, which is used for a specific portal site to collect news and automatically extract news
Detailed Description
The invention is described in detail below with reference to the drawings and the detailed description.
Example 1
As shown in fig. 1, a method for collecting news and automatically extracting the news based on a depth map neural network includes the following steps:
s1, collecting news sites for training; in the embodiment, through the means of the prior art, the home page HTML source code of the news website is collected and used as the annotation data, and 1000 news home page webpages are collected;
s2, marking the collected news links in the news site by taking the HTML label as a unit, and training through a depth map neural network to obtain a link extraction model;
s3, marking the collected news contents in the news links by taking the HTML tags as units, and constructing a news content extraction model;
s4, marking the text content in the collected news content, and constructing a text classification model;
s5, collecting main page HTML source codes of the news sites to be analyzed, inputting the main page HTML source codes into the link extraction model, and obtaining news links; then inputting the obtained news link into a news content extraction model to obtain news content; and finally, inputting the obtained news content into a text classification model, and extracting news information.
Example 2
As shown in fig. 1, a method for collecting news and automatically extracting the news based on a depth map neural network includes the following steps:
s1, collecting news sites for training;
s2, marking the collected news links in the news site by taking the HTML label as a unit, and training through a depth map neural network to obtain a link extraction model;
s3, marking the collected news contents in the news links by taking the HTML tags as units, and constructing a news content extraction model;
s4, marking the text content in the collected news content, and constructing a text classification model;
s5, collecting main page HTML source codes of the news sites to be analyzed, inputting the main page HTML source codes into the link extraction model, and obtaining news links; then inputting the obtained news link into a news content extraction model to obtain news content; and finally, inputting the obtained news content into a text classification model, and extracting news information.
As shown in fig. 2, in an embodiment, in step S2, the specific steps are:
s201, with an HTML tag as a unit, a marking person can enter a webpage with a large number of news links after judging and clicking the tag by experience, wherein the tags of 'news', 'video', 'sports', 'science and technology' are marked as 'plate' tags by the marking person; the annotator "clicks" news "related labels in the respective" panels "and takes" xx ceremonies "in the city of xx yesterday and" the world health organization: new cases of coronary pneumonia break through xxx "and are marked as a" news text "label; in the embodiment, the annotating personnel annotate in batches through expressions of the CSS selector;
S202, constructing a first node classification model of the heteromorphic graph through the obtained news link label, and taking the first node classification model as a link extraction model;
and S203, training and obtaining a trained link extraction model.
As shown in fig. 3, in a specific embodiment, step S202 is specifically; the method comprises the steps of constructing a first node classification model of a heteromorphic graph by taking a tag in an HTML source code as a node of a depth map neural network, taking a parent-child relation and a brother relation of the tag in the HTML source code as edges of the depth map neural network, taking attributes and characters in the tag as characteristics of the node, taking the first node classification model as a link extraction model, and testing to obtain the link accuracy and the recall rate of the model extraction, wherein the link accuracy and the recall rate can reach about 95%.
In an embodiment, step S203 specifically includes: and taking the extraction task of the news link as a node classification task of the deep map neural network, dividing the labeled data in the S201 into a training set, a verification set and a test set, and training the link extraction model.
Example 3
As shown in fig. 1, a method for collecting news and automatically extracting the news based on a depth map neural network includes the following steps:
S1, collecting news sites for training;
s2, marking the collected news links in the news site by taking the HTML label as a unit, and training through a depth map neural network to obtain a link extraction model;
s3, marking the collected news contents in the news links by taking the HTML tags as units, and constructing a news content extraction model;
s4, marking the text content in the collected news content, and constructing a text classification model;
s5, collecting main page HTML source codes of the news sites to be analyzed, inputting the main page HTML source codes into the link extraction model, and obtaining news links; then inputting the obtained news link into a news content extraction model to obtain news content; and finally, inputting the obtained news content into a text classification model, and extracting news information.
As shown in fig. 2, in an embodiment, in step S2, the specific steps are:
s201, with an HTML tag as a unit, a marking person can enter a webpage with a large number of news links after judging and clicking the tag by experience, wherein the tags of 'news', 'video', 'sports', 'science and technology' are marked as 'plate' tags by the marking person; the annotator "clicks" news "related labels in the respective" panels "and takes" xx ceremonies "in the city of xx yesterday and" the world health organization: new cases of coronary pneumonia break through xxx "and are marked as a" news text "label; in the embodiment, the annotating personnel annotate in batches through expressions of the CSS selector;
S202, constructing a first node classification model of the heteromorphic graph through the obtained news link label, and taking the first node classification model as a link extraction model;
and S203, training and obtaining a trained link extraction model.
As shown in fig. 3, in an embodiment, step S202 is specifically; and constructing a first node classification model of the heteromorphic graph by taking the tags in the HTML source codes as nodes of the depth map neural network, taking the parent-child relationship and the brother relationship of the tags in the HTML source codes as edges of the depth map neural network, taking the attributes and characters in the tags as the characteristics of the nodes, and taking the first node classification model as a link extraction model. 1000 depth map neural networks were obtained in the present example.
The attributes of the neural network node of the depth map in this embodiment include a tag type, an id attribute, a class attribute, an href attribute, a location where the tag is located, the number of sub-tags, a sub-tag type, and the like of the HTML tag.
In this embodiment, the positions of the tags may be sorted by the appearance order and then used as the features.
The number of sub-tags in this embodiment is, that is, the number of tags using the HTML tag as a parent node; the sub-label type is characterized by whether the sub-label contains characters or links.
In an embodiment, step S203 specifically includes: and taking the extraction task of the news link as a node classification task of the deep map neural network, dividing the labeled data in the S201 into a training set, a verification set and a test set, and training the link extraction model.
In this embodiment, 800 depth map neural networks are randomly allocated as a training set, 100 depth map neural networks are used as a verification set, and the remaining 100 depth map neural networks are used as a test set, and a depth map neural network is established for model training to obtain a link extraction model.
In the embodiment, a GCNN deep neural network is adopted as a depth map neural network model architecture.
In an embodiment, in step S3, the specific steps are:
s301, marking the 'news link' collected in the step S2, and respectively marking a 'title' label, a 'release time' label, an 'author' label, a 'content' label and a 'source' label in the news text page by taking an HTML (hypertext markup language) label of the news text page as a unit; in the embodiment, 1000 news links in S003 are extracted for annotation; in this embodiment, the news text tag is generally a < p > tag, and is a plurality of continuous tags, which generally occupy the largest space of the web page; the headline tag is generally above the news body tag; author tags, source tags, generally located between the news body tag and the title tag, or immediately after the news body tag; the publication time tag is generally located between the news body tag and the title tag; in the embodiment, expressions of a CSS selector are adopted for batch marking;
S302, constructing a second node classification model of the heteromorphic graph, and taking the second node classification model as a news information extraction model;
and S303, training the news information extraction model.
In an embodiment, step S302 specifically includes: the method comprises the steps of constructing a node classification model of a heteromorphic graph by taking a tag in an HTML source code as a node of a depth graph neural network, taking a parent-child relationship and a brother relationship of the tag in the HTML source code as edges of the depth graph neural network, taking attributes and characters in the tag as characteristics of the node, taking a second node classification model as a news information extraction model, and testing to ensure that the accuracy and recall rate of the model for structured extraction of news can reach more than 90%;
in the embodiment, a GCNN deep neural network is adopted as a depth map neural network model architecture.
In a specific embodiment, step S303 specifically includes: and taking the task of extracting the news information as a node classification task of the depth map neural network, dividing the labeled data in the S301 into a training set, a verification set and a test set, and training the news information extraction model.
In this embodiment, 3000 pieces of data are labeled, and it is ensured that the number of data labeled with "news" and the number of data labeled with "noise" is not less than 1000.
Example 4
As shown in fig. 1, a method for collecting news and automatically extracting the news based on a depth map neural network includes the following steps:
s1, collecting news sites for training;
s2, marking the collected news links in the news site by taking the HTML label as a unit, and training through a depth map neural network to obtain a link extraction model;
s3, marking the collected news contents in the news links by taking the HTML tags as units, and constructing a news content extraction model;
s4, marking the text content in the collected news content, and constructing a text classification model;
s5, collecting main page HTML source codes of the news sites to be analyzed, inputting the main page HTML source codes into the link extraction model, and obtaining news links; then inputting the obtained news link into a news content extraction model to obtain news content; and finally, inputting the obtained news content into a text classification model, and extracting news information.
As shown in fig. 2, in an embodiment, in step S2, the specific steps are:
s201, respectively marking 'plates' in a site HTML page by taking an HTML tag as a unit, and marking a 'news link' tag in each 'plate';
S202, constructing a first node classification model of the heteromorphic graph through the obtained news link label, and taking the first node classification model as a link extraction model;
and S203, training and obtaining the trained link extraction model.
As shown in fig. 3, in a specific embodiment, step S202 is specifically; and constructing a first node classification model of the heteromorphic graph by taking the tags in the HTML source codes as nodes of the depth map neural network, taking the parent-child relationship and the brother relationship of the tags in the HTML source codes as edges of the depth map neural network, taking the attributes and characters in the tags as the characteristics of the nodes, and taking the first node classification model as a link extraction model.
In an embodiment, step S203 specifically includes: and taking the extraction task of the news link as a node classification task of the deep map neural network, dividing the labeled data in the S201 into a training set, a verification set and a test set, and training the link extraction model.
In one embodiment, step S3 includes the following steps:
s301, marking the 'news link' collected in the step S2, and respectively marking a 'title' label, a 'release time' label, an 'author' label, a 'content' label and a 'source' label in the news text page by taking an HTML (hypertext markup language) label of the news text page as a unit;
S302, constructing a second node classification model of the heteromorphic graph, and taking the second node classification model as a news information extraction model;
and S303, training the news information extraction model.
In an embodiment, step S302 specifically includes: constructing a node classification model of the heteromorphic graph by taking a tag in an HTML source code as a node of the depth map neural network, taking a parent-child relationship and a brother relationship of the tag in the HTML source code as edges of the depth map neural network, taking attributes and characters in the tag as characteristics of the node, and taking a second node classification model as a news information extraction model;
in an embodiment, step S303 specifically includes: and taking the task of extracting the news information as a node classification task of the depth map neural network, dividing the labeled data in the S301 into a training set, a verification set and a test set, and training the news information extraction model.
In one embodiment, step S4 includes the following steps:
s401, marking the contents obtained in the step S3, wherein each news content is marked as a news label or a noise label;
s402, obtaining and training a text classification model according to the collected news labels or noise labels.
In one embodiment, in step S401, the "noise" tag includes recruitment information, advertisement, news website introduction.
In one embodiment, in step S402, a text classification model is established and trained by a natural language algorithm according to the collected "news" tags or "noise" tags, and the precision rate and recall rate of the test classification can reach more than 95%.
In this embodiment, as shown in a specific flowchart, in fig. 4, for example, data collection of a large portal news website is used, HTML source codes of a homepage of the large portal news website are collected every week, are converted into a depth map neural network, are input into a link extraction model, and corresponding plate links and news text links in the homepage are output; and recording the obtained plate link into the acquisition record of the homepage, acquiring an HTML source code of the link if the plate link does not appear in the acquisition record, converting the HTML source code into a depth map neural network, inputting the depth map neural network into a link extraction model, outputting the corresponding plate link and news text link in the webpage by the model, and repeatedly executing acquisition and prediction tasks for the plate link.
In this embodiment, the extracted news text link is used to collect the HTML source code of the link, convert the HTML source code into a depth map neural network, input the depth map neural network into a news information extraction model, and output the corresponding news text, title, publication date, author, and source.
In this embodiment, the text classification model is input to the extracted news text, and the news information that is output as the "news" classification is stored. So far, news extraction of the news website is completed, and a batch of news and structured data corresponding to the news, such as title, text, author, publication time and source, are obtained.
According to the invention, the link extraction model, the news content extraction model and the text classification model are obtained by collecting and marking news sites for training, so that the automatic extraction of news content is realized, the problems that the prior art is complex in calculation and cannot conveniently and quickly extract important information in news are solved, and the method has the characteristics of simplicity in calculation, high efficiency and convenience.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (10)

1. A news collection and automatic extraction method based on a depth map neural network is characterized by comprising the following steps: the method comprises the following specific steps:
S1, collecting news sites for training;
s2, marking the collected news links in the news site by taking the HTML label as a unit, and training through a depth map neural network to obtain a link extraction model;
s3, marking the collected news contents in the news links by taking the HTML labels as units, and constructing a news content extraction model;
s4, marking the text content in the collected news content, and constructing a text classification model;
s5, collecting main page HTML source codes of the news sites to be analyzed, inputting the main page HTML source codes into the link extraction model, and obtaining news links; then inputting the obtained news link into a news content extraction model to obtain news content; and finally, inputting the obtained news content into a text classification model, and extracting news information.
2. The method for collecting and automatically extracting news based on the depth map neural network as claimed in claim 1, wherein: step S2, the specific steps are:
s201, respectively marking 'plates' in a site HTML page by taking an HTML tag as a unit, and marking a 'news link' tag in each 'plate';
s202, constructing a first node classification model of the heteromorphic graph through the obtained news link label, and taking the first node classification model as a link extraction model;
And S203, training the link extraction model, and obtaining the trained link extraction model.
3. The method for collecting and automatically extracting news based on the depth map neural network as claimed in claim 2, wherein the method comprises the following steps: step S202, specifically, the method comprises the following steps; and constructing a first node classification model of the heteromorphic graph by taking the tags in the HTML source codes as nodes of the depth map neural network, taking the parent-child relationship and the brother relationship of the tags in the HTML source codes as edges of the depth map neural network, taking the attributes and characters in the tags as the characteristics of the nodes, and taking the first node classification model as a link extraction model.
4. The method for collecting and automatically extracting news based on the depth map neural network as claimed in claim 3, wherein the method comprises the following steps: step S203, specifically: and taking the extraction task of the news link as a node classification task of the deep map neural network, dividing the labeled data in the S201 into a training set, a verification set and a test set, and training the link extraction model.
5. The method of claim 4, wherein the method comprises: step S3, the specific steps are:
s301, marking the 'news link' collected in the step S2, and respectively marking a 'title' label, a 'release time' label, an 'author' label, a 'content' label and a 'source' label in the news text page by taking an HTML (hypertext markup language) label of the news text page as a unit;
S302, constructing a second node classification model of the heteromorphic graph, and taking the second node classification model as a news information extraction model;
and S303, training the news information extraction model.
6. The method of claim 5, wherein the method comprises: step S302, specifically: constructing a node classification model of the heteromorphic graph by taking a tag in an HTML source code as a node of the depth map neural network, taking a parent-child relationship and a brother relationship of the tag in the HTML source code as edges of the depth map neural network, taking attributes and characters in the tag as characteristics of the node, and taking a second node classification model as a news information extraction model;
7. the method of claim 6, wherein the method comprises: step S303, specifically: and taking the task of extracting the news information as a node classification task of the depth map neural network, dividing the labeled data in the S301 into a training set, a verification set and a test set, and training the news information extraction model.
8. The method of claim 7, wherein the method comprises: step S4, the specific steps are:
S401, labeling the contents obtained in the step S3, wherein each news content is labeled as a news label or a noise label;
s402, training a text classification model according to the collected news labels or noise labels.
9. The method for collecting and automatically extracting news based on the depth map neural network as claimed in claim 8, wherein: in step S401, the "murmur" label includes recruitment information, advertisements, and news website introductions.
10. The method of claim 9, wherein the method comprises: in step S402, a text classification model is established and trained by a natural language algorithm according to the collected "news" labels or "noise" labels.
CN202210109381.1A 2022-01-28 2022-01-28 News collection and automatic extraction method based on depth map neural network Pending CN114564638A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210109381.1A CN114564638A (en) 2022-01-28 2022-01-28 News collection and automatic extraction method based on depth map neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210109381.1A CN114564638A (en) 2022-01-28 2022-01-28 News collection and automatic extraction method based on depth map neural network

Publications (1)

Publication Number Publication Date
CN114564638A true CN114564638A (en) 2022-05-31

Family

ID=81713204

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210109381.1A Pending CN114564638A (en) 2022-01-28 2022-01-28 News collection and automatic extraction method based on depth map neural network

Country Status (1)

Country Link
CN (1) CN114564638A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910393A (en) * 2023-09-13 2023-10-20 戎行技术有限公司 Large-batch news data acquisition method based on recurrent neural network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910393A (en) * 2023-09-13 2023-10-20 戎行技术有限公司 Large-batch news data acquisition method based on recurrent neural network
CN116910393B (en) * 2023-09-13 2023-12-12 戎行技术有限公司 Large-batch news data acquisition method based on recurrent neural network

Similar Documents

Publication Publication Date Title
Blismas et al. Computer-aided qualitative data analysis: panacea or paradox?
CN110046261B (en) Construction method of multi-modal bilingual parallel corpus of construction engineering
CN111783394B (en) Training method of event extraction model, event extraction method, system and equipment
CN102163187B (en) Document marking method and device
CN103023714B (en) The liveness of topic Network Based and cluster topology analytical system and method
CN102662930A (en) Corpus tagging method and corpus tagging device
CN106570171A (en) Semantics-based sci-tech information processing method and system
CN102662969B (en) Internet information object positioning method based on webpage structure semantic meaning
CN112749284B (en) Knowledge graph construction method, device, equipment and storage medium
CN103927397B (en) Recognition method for Web page link blocks based on block tree
CN106446072B (en) The treating method and apparatus of web page contents
CN102135976B (en) Hypertext markup language page structured data extraction method and device
CN102253937A (en) Method and related device for acquiring information of interest in webpages
CN110609983A (en) Structured decomposition method for policy file
CN101661468B (en) Method for extracting post metadata from forum post list pages
CN106844782B (en) Network-oriented multi-channel big data acquisition system and method
CN111737623A (en) Webpage information extraction method and related equipment
CA2794763C (en) System for use in editorial review of stored information
CN103699370A (en) SurvML (Survey Marked Language) design and development method based on XML (Extensive Markup Language)
CN114564638A (en) News collection and automatic extraction method based on depth map neural network
CN103559202B (en) A kind of webpage content extraction apparatus and method
CN110162684B (en) Machine reading understanding data set construction and evaluation method based on deep learning
CN115658993B (en) Intelligent extraction method and system for core content of webpage
Aqel et al. A framework for employee appraisals based on sentiment analysis
Sannier et al. Legal markup generation in the large: an experience report

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination