CN111581476A - Intelligent webpage information extraction method based on BERT and LSTM - Google Patents
Intelligent webpage information extraction method based on BERT and LSTM Download PDFInfo
- Publication number
- CN111581476A CN111581476A CN202010351978.8A CN202010351978A CN111581476A CN 111581476 A CN111581476 A CN 111581476A CN 202010351978 A CN202010351978 A CN 202010351978A CN 111581476 A CN111581476 A CN 111581476A
- Authority
- CN
- China
- Prior art keywords
- webpage
- information
- training
- information extraction
- bert
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the field of Internet information extraction and mining, and discloses an intelligent webpage information extraction method based on BERT and LSTM, which comprises the following steps: s1, crawling a webpage; s2, preprocessing a webpage; s3, webpage target labeling; s4, training a neural network model; s5, deploying a neural network model; s6, after the training is completed, the target information is recognized for the input web page using the obtained model. According to the scheme, the information can be accurately positioned after the page content is updated and the webpage structure is changed, so that an accurate extraction result is obtained, information can be intelligently extracted from the webpage, the webpage information extraction cost is reduced, the webpage information extraction speed is increased, through detection, the identification accuracy of the model to the page in the training set is over 98%, and the recall rate is over 97%.
Description
Technical Field
The invention relates to the technical field of Internet information extraction and mining, in particular to an intelligent webpage information extraction method based on BERT and LSTM.
Background
With the rapid development of science and technology, the Web has become the largest encyclopedia database in the world, and although a user can conveniently acquire information on the Web through an application program such as a browser, information retrieval in the internet is difficult, if internet webpage information can be extracted, the information is structurally stored to form a knowledge graph, and searching can be more accurate and convenient.
Information extraction based on an HTML structure is an information extraction mode adopted by a web crawler in a general sense, firstly, on the premise of positioning the position of information in HTML, a regular expression, an HTML selector and other modes are combined, extraction rules which accord with the current HTML page format are compiled, and data are extracted from an HTML document, so that the extraction mode is the most widely applied extraction mode at present. However, this method has poor expansibility, requires a large amount of manual work, and is very high in cost.
The concept of Deep Learning (Deep Learning) is derived from the research make internal disorder or usurp of artificial neural networks, is a branch of the field of Machine Learning (Machine Learning), and is a set of algorithms that model high-level abstract data. Compared with Shallow Learning (Shallow Learning), the method can form more abstract high-level features by combining low-level features at multiple levels, thereby realizing automatic Learning features without human participation in feature selection.
The rapid development of Deep learning in recent years is because a Layer-wise Pre-training algorithm based on a Deep belief Network (Deep belief Network DBN) is proposed by Hinton et al of Toronto university of Canada in 2006, and the algorithm successfully solves the problem that the learning effect of Deep structure arithmetic is not ideal, and brings hope for solving the optimization problem related to the Deep structure.
The Bert model (Bidirectional Encoder expressions from transforms) Bert is a transform-based bi-directional encoding characterization. Bi-directional meaning means that it can take into account information of words preceding and following a word when processing the word, thereby obtaining the semantics of the context.
The Long-short Term Memory model (LSTM) is a specific RNN structural model, and is intended to solve the Long-Term dependency problem (Long Term Dependencies) in a neural network, i.e., the ability to use past sequence information to infer current sequence information. The basic node in the LSTM is called a 'cell', a state parameter in the cell is responsible for memorizing information, Input and Output of the cell interact with the cell through an Input Gate and an Output Gate respectively, and meanwhile, a Forget Gate is added into a model to discard and Forget expired memorizing information, so that a certain degree of memorizing effect is achieved.
The prior art has the following problems: accurate information extraction requires a lot of manual workload, manual work is information extracted by analyzing the page structure positioning, and since webpage information is data which is dynamically changed and updated in real time, after page content updating and webpage structure changing, the problem of extraction failure or inaccurate extraction result caused by positioning information failure easily occurs.
Disclosure of Invention
1. Technical problem to be solved
Aiming at the problems in the prior art, the invention aims to provide an intelligent webpage information extraction method based on BERT and LSTM, which can still accurately position information after page content is updated and a webpage structure is changed, so as to obtain an accurate extraction result.
2. Technical scheme
In order to solve the above problems, the present invention adopts the following technical solutions.
An intelligent webpage information extraction method based on BERT and LSTM comprises the following steps:
s1, crawling the webpage: crawling a vertical field webpage to be extracted, crawling at least 3000 company page information as a training set, and crawling 500 page information as a testing set;
s2, webpage preprocessing: preprocessing a webpage, and cleaning useless HTML tags;
s3, webpage target labeling: automatically labeling a webpage target by using a manual rule, manually observing an HTML (hypertext markup language) label needing to extract information, and then marking a relevant classification mark on the HTML label by using xpath;
s4, training a neural network model: firstly establishing a word bank table and mapping, after the establishment is finished, segmenting a webpage source code into words, and then finely tuning on the basis of a pre-trained BERT model to obtain word vectors;
s5, deploying a neural network model: carrying out automatic information extraction on the webpage and evaluating the accuracy and the recall rate of the model;
and S6, after training is completed, identifying the target information of the input webpage by using the obtained model, testing the identification effect by using 500 pages in the test set, neglecting the marking information in the identification process, and judging the type of the target information according to the content and the context of the text node.
Preferably, the company page information includes a uniform social credit code, a type, a business range, a residence, and the like.
Preferably, the Nesterov Momentum algorithm is adopted as the optimization algorithm for the training neural network model, 100 epoll training is carried out on the training set, and the loss obtained by training is less than 0.09.
3. Advantageous effects
Compared with the prior art, the invention has the advantages that:
(1) the method and the device are applied to the fields of Internet information extraction and mining, open knowledge graph construction and the like, and can still accurately position information after page content updating and webpage structure change to obtain an accurate extraction result, so that information can be intelligently extracted from the webpage, the extraction cost of the webpage information is reduced, and the extraction speed of the webpage information is improved.
(2) Through detection, the recognition accuracy of the model to the pages in the training set is over 98%, and the recall rate is over 97%.
Drawings
FIG. 1 is a diagram of a neural network architecture according to the present invention.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention; it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and all other embodiments obtained by those skilled in the art without any inventive work are within the scope of the present invention.
In the description of the present invention, it should be noted that the terms "upper", "lower", "inner", "outer", "top/bottom", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplification of description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present invention. Furthermore, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "disposed," "sleeved/connected," "connected," and the like are to be construed broadly, e.g., "connected," which may be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
Referring to fig. 1, an intelligent web page information extraction method based on BERT and LSTM includes the following steps:
s1, crawling the webpage: the method comprises the steps of crawling vertical domain webpages to be extracted, crawling at least 3000 company page information as a training set and 500 page information as a testing set, wherein information extraction aims at realizing automatic extraction of target information of similar websites with similar structures through training of a neural network, and the trouble of independently compiling extraction rules for each website is eliminated;
s2, webpage preprocessing: preprocessing the webpage, cleaning useless HTML tags, preprocessing the webpage before training to simplify the content of the webpage information, wherein the preprocessed webpage information has very clear structure and is easy to observe the position structure of the target information, for example, the type information is closely followed by the 'type' node, the unified social credit code information is closely followed by the 'unified social credit code' node, and the position structure is especially important for the system to judge the type of the target information, so that preprocessing is an important link in the process of adding a common webpage into a training set;
s3, webpage target labeling: the method comprises the steps that automatic labeling of a webpage target is carried out by utilizing a manual rule, an HTML label needing information extraction is observed manually, then relevant classification marks are marked on the HTML label by utilizing xpath, the purpose is to tell a system which general information needs to be obtained in a webpage structure, so that the information is identified in training, and a page which is correctly marked can be used as an effective sample to be added into a training set, so that the correctness of a training result is ensured;
s5, training a neural network model: establishing a word bank table and establishing mapping, wherein the word bank table needs to be established for words encountered in webpage information in order to identify key words in text nodes, identify characteristics of the text nodes and generate word vectors for neural network input;
based on the structural characteristics of information in a webpage, characters in each text node are more like a whole, so that mapping to a word bank table is established for each single Chinese character instead of a word group, and single mapping is established for each analyzed foreign word;
after the establishment is finished, segmenting the webpage source code into a word, and then finely adjusting the word on the basis of a pre-trained BERT model to obtain a word vector;
based on the characteristics of information in a Web page, a multi-level neural network is used for constructing a model, wherein a Bert layer and a softmax layer are respectively used for word vector representation and result classification of page information, an LSTM is used for training and learning characteristics of text information between pages in the middle two layers, and in view of the strong capability of the LSTM in natural language processing, the Web information extraction technology and the LSTM are combined to obtain two additional advantages, namely, a word segmentation step in text processing and a step of establishing characteristic engineering in information extraction are omitted, because the characteristics of the text information of the Web page, namely information text, are usually embedded into a tag page in an HTML format in small segment units, and most of texts in the tag page contain independent semantics, the model is directly taken as the unit of text nodes in the tag page during establishment without performing word segmentation on the information in the whole webpage, and one of the advantages of the neural network is that a fuzzy relation between entities can be established, through the relationship, the description relationship between the page information can be established.
S5, deploying a neural network model: and carrying out automatic information extraction on the webpage and evaluating the accuracy and the recall ratio of the model.
And S6, after training is completed, identifying the target information of the input webpage by using the obtained model, testing the identification effect by using 500 pages in the test set, neglecting the marking information in the identification process, and judging the type of the target information according to the content and the context of the text node.
Further, the company page information includes a unified social credit code, a type, a business range, a residence, and the like.
Further, a Nesterov Momentum algorithm is adopted as an optimization algorithm for training a neural network model, 100 epoll training is carried out on a training set, the loss obtained by training is less than 0.09, the Nesterov Momentum algorithm is adopted as the optimization algorithm of the neural network, all weights are updated by using a Momentum optimization factor of 0.9, softmax cross entropy is adopted as a loss function for classification in each iteration training of the network to train each level of neural network, a Nesterov Momentum algorithm is used for calculating a gradient value in the error value back propagation process, parameters of the network are updated by using the calculated gradient value, an L2 regularization term is used for all parameters of the network, a weight attenuation factor is 0.0005, and 100poll training is carried out on a training data set.
Through detection, the recognition accuracy of the model to the pages in the training set is over 98%, and the recall rate is over 97%.
The foregoing is only a preferred embodiment of the present invention; the scope of the invention is not limited thereto. Any person skilled in the art should be able to cover the technical scope of the present invention by equivalent or modified solutions and modifications within the technical scope of the present invention.
Claims (3)
1. An intelligent webpage information extraction method based on BERT and LSTM is characterized in that: the method comprises the following steps:
s1, crawling the webpage: crawling a vertical field webpage to be extracted, crawling at least 3000 company page information as a training set, and crawling 500 page information as a testing set;
s2, webpage preprocessing: preprocessing a webpage, and cleaning useless HTML tags;
s3, webpage target labeling: automatically labeling a webpage target by using a manual rule, manually observing an HTML (hypertext markup language) label needing to extract information, and then marking a relevant classification mark on the HTML label by using xpath;
s5, training a neural network model: firstly establishing a word bank table and mapping, after the establishment is finished, segmenting a webpage source code into words, and then finely tuning on the basis of a pre-trained BERT model to obtain word vectors;
s5, deploying a neural network model: carrying out automatic information extraction on the webpage and evaluating the accuracy and the recall rate of the model;
and S6, after training is completed, identifying the target information of the input webpage by using the obtained model, testing the identification effect by using 500 pages in the test set, neglecting the marking information in the identification process, and judging the type of the target information according to the content and the context of the text node.
2. The intelligent web page information extraction method based on BERT and LSTM as claimed in claim 1, wherein: the corporate page information includes a unified social credit code, type, business range, residence, etc.
3. The intelligent web page information extraction method based on BERT and LSTM as claimed in claim 1, wherein: the training neural network model adopts a Nesterov Momentum algorithm as an optimization algorithm, 100 epoll training is carried out on a training set, and the loss obtained by training is less than 0.09.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010351978.8A CN111581476A (en) | 2020-04-28 | 2020-04-28 | Intelligent webpage information extraction method based on BERT and LSTM |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010351978.8A CN111581476A (en) | 2020-04-28 | 2020-04-28 | Intelligent webpage information extraction method based on BERT and LSTM |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111581476A true CN111581476A (en) | 2020-08-25 |
Family
ID=72126240
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010351978.8A Pending CN111581476A (en) | 2020-04-28 | 2020-04-28 | Intelligent webpage information extraction method based on BERT and LSTM |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111581476A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112069046A (en) * | 2020-08-28 | 2020-12-11 | 平安科技(深圳)有限公司 | Data leakage reminding method, device, equipment and computer readable storage medium |
CN112131404A (en) * | 2020-09-19 | 2020-12-25 | 哈尔滨工程大学 | Entity alignment method in four-risk one-gold domain knowledge graph |
CN112528190A (en) * | 2020-12-23 | 2021-03-19 | 中移(杭州)信息技术有限公司 | Web page tampering judgment method and device based on fragmentation structure and content and storage medium |
CN112667878A (en) * | 2020-12-31 | 2021-04-16 | 平安国际智慧城市科技股份有限公司 | Webpage text content extraction method and device, electronic equipment and storage medium |
CN113312568A (en) * | 2021-03-25 | 2021-08-27 | 罗普特科技集团股份有限公司 | Web information extraction method and system based on HTML source code and webpage snapshot |
CN113687831A (en) * | 2021-07-07 | 2021-11-23 | 杭州未名信科科技有限公司 | Method and device for generating data acquisition script, computer equipment and storage medium |
CN116975410A (en) * | 2023-09-22 | 2023-10-31 | 北京中关村科金技术有限公司 | Webpage data acquisition method and device, electronic equipment and readable storage medium |
-
2020
- 2020-04-28 CN CN202010351978.8A patent/CN111581476A/en active Pending
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112069046A (en) * | 2020-08-28 | 2020-12-11 | 平安科技(深圳)有限公司 | Data leakage reminding method, device, equipment and computer readable storage medium |
CN112069046B (en) * | 2020-08-28 | 2022-03-29 | 平安科技(深圳)有限公司 | Data leakage reminding method, device, equipment and computer readable storage medium |
CN112131404A (en) * | 2020-09-19 | 2020-12-25 | 哈尔滨工程大学 | Entity alignment method in four-risk one-gold domain knowledge graph |
CN112131404B (en) * | 2020-09-19 | 2022-09-27 | 哈尔滨工程大学 | Entity alignment method in four-risk one-gold domain knowledge graph |
CN112528190A (en) * | 2020-12-23 | 2021-03-19 | 中移(杭州)信息技术有限公司 | Web page tampering judgment method and device based on fragmentation structure and content and storage medium |
CN112667878A (en) * | 2020-12-31 | 2021-04-16 | 平安国际智慧城市科技股份有限公司 | Webpage text content extraction method and device, electronic equipment and storage medium |
CN113312568A (en) * | 2021-03-25 | 2021-08-27 | 罗普特科技集团股份有限公司 | Web information extraction method and system based on HTML source code and webpage snapshot |
CN113312568B (en) * | 2021-03-25 | 2022-06-17 | 罗普特科技集团股份有限公司 | Web information extraction method and system based on HTML source code and webpage snapshot |
CN113687831A (en) * | 2021-07-07 | 2021-11-23 | 杭州未名信科科技有限公司 | Method and device for generating data acquisition script, computer equipment and storage medium |
CN116975410A (en) * | 2023-09-22 | 2023-10-31 | 北京中关村科金技术有限公司 | Webpage data acquisition method and device, electronic equipment and readable storage medium |
CN116975410B (en) * | 2023-09-22 | 2023-12-19 | 北京中关村科金技术有限公司 | Webpage data acquisition method and device, electronic equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111581476A (en) | Intelligent webpage information extraction method based on BERT and LSTM | |
CN110598000A (en) | Relationship extraction and knowledge graph construction method based on deep learning model | |
CN110134771A (en) | A kind of implementation method based on more attention mechanism converged network question answering systems | |
CN110516256A (en) | A kind of Chinese name entity extraction method and its system | |
CN106383816B (en) | The recognition methods of Chinese minority area place name based on deep learning | |
CN103823824B (en) | A kind of method and system that text classification corpus is built automatically by the Internet | |
CN108804612B (en) | Text emotion classification method based on dual neural network model | |
CN109493265A (en) | A kind of Policy Interpretation method and Policy Interpretation system based on deep learning | |
CN107203511A (en) | A kind of network text name entity recognition method based on neutral net probability disambiguation | |
CN108804689A (en) | The label recommendation method of the fusion hidden connection relation of user towards answer platform | |
CN108984775B (en) | Public opinion monitoring method and system based on commodity comments | |
CN112417880A (en) | Court electronic file oriented case information automatic extraction method | |
CN112883714B (en) | ABSC task syntactic constraint method based on dependency graph convolution and transfer learning | |
CN107491655A (en) | Liver diseases information intelligent consultation method and system based on machine learning | |
CN111881398B (en) | Page type determining method, device and equipment and computer storage medium | |
CN115017513A (en) | Intelligent contract vulnerability detection method based on artificial intelligence | |
CN114169447B (en) | Event detection method based on self-attention convolution bidirectional gating cyclic unit network | |
CN113779249B (en) | Cross-domain text emotion classification method and device, storage medium and electronic equipment | |
CN117539996A (en) | Consultation question-answering method and system based on user portrait | |
Schicchi et al. | Attention-based model for evaluating the complexity of sentences in English language | |
CN113343665B (en) | Commodity comment emotion analysis method and system based on aspect-level fine granularity | |
CN117077631A (en) | Knowledge graph-based engineering emergency plan generation method | |
Xu et al. | Automatic task requirements writing evaluation via machine reading comprehension | |
CN110413789A (en) | A kind of exercise automatic classification method based on SVM | |
CN114330350B (en) | Named entity recognition method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |