CN111581476A - Intelligent webpage information extraction method based on BERT and LSTM - Google Patents

Intelligent webpage information extraction method based on BERT and LSTM Download PDF

Info

Publication number
CN111581476A
CN111581476A CN202010351978.8A CN202010351978A CN111581476A CN 111581476 A CN111581476 A CN 111581476A CN 202010351978 A CN202010351978 A CN 202010351978A CN 111581476 A CN111581476 A CN 111581476A
Authority
CN
China
Prior art keywords
webpage
information
training
information extraction
bert
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010351978.8A
Other languages
Chinese (zh)
Inventor
王敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Hezong Data Technology Co ltd
Original Assignee
Shenzhen Hezong Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Hezong Data Technology Co ltd filed Critical Shenzhen Hezong Data Technology Co ltd
Priority to CN202010351978.8A priority Critical patent/CN111581476A/en
Publication of CN111581476A publication Critical patent/CN111581476A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of Internet information extraction and mining, and discloses an intelligent webpage information extraction method based on BERT and LSTM, which comprises the following steps: s1, crawling a webpage; s2, preprocessing a webpage; s3, webpage target labeling; s4, training a neural network model; s5, deploying a neural network model; s6, after the training is completed, the target information is recognized for the input web page using the obtained model. According to the scheme, the information can be accurately positioned after the page content is updated and the webpage structure is changed, so that an accurate extraction result is obtained, information can be intelligently extracted from the webpage, the webpage information extraction cost is reduced, the webpage information extraction speed is increased, through detection, the identification accuracy of the model to the page in the training set is over 98%, and the recall rate is over 97%.

Description

Intelligent webpage information extraction method based on BERT and LSTM
Technical Field
The invention relates to the technical field of Internet information extraction and mining, in particular to an intelligent webpage information extraction method based on BERT and LSTM.
Background
With the rapid development of science and technology, the Web has become the largest encyclopedia database in the world, and although a user can conveniently acquire information on the Web through an application program such as a browser, information retrieval in the internet is difficult, if internet webpage information can be extracted, the information is structurally stored to form a knowledge graph, and searching can be more accurate and convenient.
Information extraction based on an HTML structure is an information extraction mode adopted by a web crawler in a general sense, firstly, on the premise of positioning the position of information in HTML, a regular expression, an HTML selector and other modes are combined, extraction rules which accord with the current HTML page format are compiled, and data are extracted from an HTML document, so that the extraction mode is the most widely applied extraction mode at present. However, this method has poor expansibility, requires a large amount of manual work, and is very high in cost.
The concept of Deep Learning (Deep Learning) is derived from the research make internal disorder or usurp of artificial neural networks, is a branch of the field of Machine Learning (Machine Learning), and is a set of algorithms that model high-level abstract data. Compared with Shallow Learning (Shallow Learning), the method can form more abstract high-level features by combining low-level features at multiple levels, thereby realizing automatic Learning features without human participation in feature selection.
The rapid development of Deep learning in recent years is because a Layer-wise Pre-training algorithm based on a Deep belief Network (Deep belief Network DBN) is proposed by Hinton et al of Toronto university of Canada in 2006, and the algorithm successfully solves the problem that the learning effect of Deep structure arithmetic is not ideal, and brings hope for solving the optimization problem related to the Deep structure.
The Bert model (Bidirectional Encoder expressions from transforms) Bert is a transform-based bi-directional encoding characterization. Bi-directional meaning means that it can take into account information of words preceding and following a word when processing the word, thereby obtaining the semantics of the context.
The Long-short Term Memory model (LSTM) is a specific RNN structural model, and is intended to solve the Long-Term dependency problem (Long Term Dependencies) in a neural network, i.e., the ability to use past sequence information to infer current sequence information. The basic node in the LSTM is called a 'cell', a state parameter in the cell is responsible for memorizing information, Input and Output of the cell interact with the cell through an Input Gate and an Output Gate respectively, and meanwhile, a Forget Gate is added into a model to discard and Forget expired memorizing information, so that a certain degree of memorizing effect is achieved.
The prior art has the following problems: accurate information extraction requires a lot of manual workload, manual work is information extracted by analyzing the page structure positioning, and since webpage information is data which is dynamically changed and updated in real time, after page content updating and webpage structure changing, the problem of extraction failure or inaccurate extraction result caused by positioning information failure easily occurs.
Disclosure of Invention
1. Technical problem to be solved
Aiming at the problems in the prior art, the invention aims to provide an intelligent webpage information extraction method based on BERT and LSTM, which can still accurately position information after page content is updated and a webpage structure is changed, so as to obtain an accurate extraction result.
2. Technical scheme
In order to solve the above problems, the present invention adopts the following technical solutions.
An intelligent webpage information extraction method based on BERT and LSTM comprises the following steps:
s1, crawling the webpage: crawling a vertical field webpage to be extracted, crawling at least 3000 company page information as a training set, and crawling 500 page information as a testing set;
s2, webpage preprocessing: preprocessing a webpage, and cleaning useless HTML tags;
s3, webpage target labeling: automatically labeling a webpage target by using a manual rule, manually observing an HTML (hypertext markup language) label needing to extract information, and then marking a relevant classification mark on the HTML label by using xpath;
s4, training a neural network model: firstly establishing a word bank table and mapping, after the establishment is finished, segmenting a webpage source code into words, and then finely tuning on the basis of a pre-trained BERT model to obtain word vectors;
s5, deploying a neural network model: carrying out automatic information extraction on the webpage and evaluating the accuracy and the recall rate of the model;
and S6, after training is completed, identifying the target information of the input webpage by using the obtained model, testing the identification effect by using 500 pages in the test set, neglecting the marking information in the identification process, and judging the type of the target information according to the content and the context of the text node.
Preferably, the company page information includes a uniform social credit code, a type, a business range, a residence, and the like.
Preferably, the Nesterov Momentum algorithm is adopted as the optimization algorithm for the training neural network model, 100 epoll training is carried out on the training set, and the loss obtained by training is less than 0.09.
3. Advantageous effects
Compared with the prior art, the invention has the advantages that:
(1) the method and the device are applied to the fields of Internet information extraction and mining, open knowledge graph construction and the like, and can still accurately position information after page content updating and webpage structure change to obtain an accurate extraction result, so that information can be intelligently extracted from the webpage, the extraction cost of the webpage information is reduced, and the extraction speed of the webpage information is improved.
(2) Through detection, the recognition accuracy of the model to the pages in the training set is over 98%, and the recall rate is over 97%.
Drawings
FIG. 1 is a diagram of a neural network architecture according to the present invention.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention; it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and all other embodiments obtained by those skilled in the art without any inventive work are within the scope of the present invention.
In the description of the present invention, it should be noted that the terms "upper", "lower", "inner", "outer", "top/bottom", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplification of description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present invention. Furthermore, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "disposed," "sleeved/connected," "connected," and the like are to be construed broadly, e.g., "connected," which may be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
Referring to fig. 1, an intelligent web page information extraction method based on BERT and LSTM includes the following steps:
s1, crawling the webpage: the method comprises the steps of crawling vertical domain webpages to be extracted, crawling at least 3000 company page information as a training set and 500 page information as a testing set, wherein information extraction aims at realizing automatic extraction of target information of similar websites with similar structures through training of a neural network, and the trouble of independently compiling extraction rules for each website is eliminated;
s2, webpage preprocessing: preprocessing the webpage, cleaning useless HTML tags, preprocessing the webpage before training to simplify the content of the webpage information, wherein the preprocessed webpage information has very clear structure and is easy to observe the position structure of the target information, for example, the type information is closely followed by the 'type' node, the unified social credit code information is closely followed by the 'unified social credit code' node, and the position structure is especially important for the system to judge the type of the target information, so that preprocessing is an important link in the process of adding a common webpage into a training set;
s3, webpage target labeling: the method comprises the steps that automatic labeling of a webpage target is carried out by utilizing a manual rule, an HTML label needing information extraction is observed manually, then relevant classification marks are marked on the HTML label by utilizing xpath, the purpose is to tell a system which general information needs to be obtained in a webpage structure, so that the information is identified in training, and a page which is correctly marked can be used as an effective sample to be added into a training set, so that the correctness of a training result is ensured;
s5, training a neural network model: establishing a word bank table and establishing mapping, wherein the word bank table needs to be established for words encountered in webpage information in order to identify key words in text nodes, identify characteristics of the text nodes and generate word vectors for neural network input;
based on the structural characteristics of information in a webpage, characters in each text node are more like a whole, so that mapping to a word bank table is established for each single Chinese character instead of a word group, and single mapping is established for each analyzed foreign word;
after the establishment is finished, segmenting the webpage source code into a word, and then finely adjusting the word on the basis of a pre-trained BERT model to obtain a word vector;
based on the characteristics of information in a Web page, a multi-level neural network is used for constructing a model, wherein a Bert layer and a softmax layer are respectively used for word vector representation and result classification of page information, an LSTM is used for training and learning characteristics of text information between pages in the middle two layers, and in view of the strong capability of the LSTM in natural language processing, the Web information extraction technology and the LSTM are combined to obtain two additional advantages, namely, a word segmentation step in text processing and a step of establishing characteristic engineering in information extraction are omitted, because the characteristics of the text information of the Web page, namely information text, are usually embedded into a tag page in an HTML format in small segment units, and most of texts in the tag page contain independent semantics, the model is directly taken as the unit of text nodes in the tag page during establishment without performing word segmentation on the information in the whole webpage, and one of the advantages of the neural network is that a fuzzy relation between entities can be established, through the relationship, the description relationship between the page information can be established.
S5, deploying a neural network model: and carrying out automatic information extraction on the webpage and evaluating the accuracy and the recall ratio of the model.
And S6, after training is completed, identifying the target information of the input webpage by using the obtained model, testing the identification effect by using 500 pages in the test set, neglecting the marking information in the identification process, and judging the type of the target information according to the content and the context of the text node.
Further, the company page information includes a unified social credit code, a type, a business range, a residence, and the like.
Further, a Nesterov Momentum algorithm is adopted as an optimization algorithm for training a neural network model, 100 epoll training is carried out on a training set, the loss obtained by training is less than 0.09, the Nesterov Momentum algorithm is adopted as the optimization algorithm of the neural network, all weights are updated by using a Momentum optimization factor of 0.9, softmax cross entropy is adopted as a loss function for classification in each iteration training of the network to train each level of neural network, a Nesterov Momentum algorithm is used for calculating a gradient value in the error value back propagation process, parameters of the network are updated by using the calculated gradient value, an L2 regularization term is used for all parameters of the network, a weight attenuation factor is 0.0005, and 100poll training is carried out on a training data set.
Through detection, the recognition accuracy of the model to the pages in the training set is over 98%, and the recall rate is over 97%.
The foregoing is only a preferred embodiment of the present invention; the scope of the invention is not limited thereto. Any person skilled in the art should be able to cover the technical scope of the present invention by equivalent or modified solutions and modifications within the technical scope of the present invention.

Claims (3)

1. An intelligent webpage information extraction method based on BERT and LSTM is characterized in that: the method comprises the following steps:
s1, crawling the webpage: crawling a vertical field webpage to be extracted, crawling at least 3000 company page information as a training set, and crawling 500 page information as a testing set;
s2, webpage preprocessing: preprocessing a webpage, and cleaning useless HTML tags;
s3, webpage target labeling: automatically labeling a webpage target by using a manual rule, manually observing an HTML (hypertext markup language) label needing to extract information, and then marking a relevant classification mark on the HTML label by using xpath;
s5, training a neural network model: firstly establishing a word bank table and mapping, after the establishment is finished, segmenting a webpage source code into words, and then finely tuning on the basis of a pre-trained BERT model to obtain word vectors;
s5, deploying a neural network model: carrying out automatic information extraction on the webpage and evaluating the accuracy and the recall rate of the model;
and S6, after training is completed, identifying the target information of the input webpage by using the obtained model, testing the identification effect by using 500 pages in the test set, neglecting the marking information in the identification process, and judging the type of the target information according to the content and the context of the text node.
2. The intelligent web page information extraction method based on BERT and LSTM as claimed in claim 1, wherein: the corporate page information includes a unified social credit code, type, business range, residence, etc.
3. The intelligent web page information extraction method based on BERT and LSTM as claimed in claim 1, wherein: the training neural network model adopts a Nesterov Momentum algorithm as an optimization algorithm, 100 epoll training is carried out on a training set, and the loss obtained by training is less than 0.09.
CN202010351978.8A 2020-04-28 2020-04-28 Intelligent webpage information extraction method based on BERT and LSTM Pending CN111581476A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010351978.8A CN111581476A (en) 2020-04-28 2020-04-28 Intelligent webpage information extraction method based on BERT and LSTM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010351978.8A CN111581476A (en) 2020-04-28 2020-04-28 Intelligent webpage information extraction method based on BERT and LSTM

Publications (1)

Publication Number Publication Date
CN111581476A true CN111581476A (en) 2020-08-25

Family

ID=72126240

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010351978.8A Pending CN111581476A (en) 2020-04-28 2020-04-28 Intelligent webpage information extraction method based on BERT and LSTM

Country Status (1)

Country Link
CN (1) CN111581476A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112069046A (en) * 2020-08-28 2020-12-11 平安科技(深圳)有限公司 Data leakage reminding method, device, equipment and computer readable storage medium
CN112131404A (en) * 2020-09-19 2020-12-25 哈尔滨工程大学 Entity alignment method in four-risk one-gold domain knowledge graph
CN112528190A (en) * 2020-12-23 2021-03-19 中移(杭州)信息技术有限公司 Web page tampering judgment method and device based on fragmentation structure and content and storage medium
CN112667878A (en) * 2020-12-31 2021-04-16 平安国际智慧城市科技股份有限公司 Webpage text content extraction method and device, electronic equipment and storage medium
CN113312568A (en) * 2021-03-25 2021-08-27 罗普特科技集团股份有限公司 Web information extraction method and system based on HTML source code and webpage snapshot
CN113687831A (en) * 2021-07-07 2021-11-23 杭州未名信科科技有限公司 Method and device for generating data acquisition script, computer equipment and storage medium
CN116975410A (en) * 2023-09-22 2023-10-31 北京中关村科金技术有限公司 Webpage data acquisition method and device, electronic equipment and readable storage medium

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112069046A (en) * 2020-08-28 2020-12-11 平安科技(深圳)有限公司 Data leakage reminding method, device, equipment and computer readable storage medium
CN112069046B (en) * 2020-08-28 2022-03-29 平安科技(深圳)有限公司 Data leakage reminding method, device, equipment and computer readable storage medium
CN112131404A (en) * 2020-09-19 2020-12-25 哈尔滨工程大学 Entity alignment method in four-risk one-gold domain knowledge graph
CN112131404B (en) * 2020-09-19 2022-09-27 哈尔滨工程大学 Entity alignment method in four-risk one-gold domain knowledge graph
CN112528190A (en) * 2020-12-23 2021-03-19 中移(杭州)信息技术有限公司 Web page tampering judgment method and device based on fragmentation structure and content and storage medium
CN112667878A (en) * 2020-12-31 2021-04-16 平安国际智慧城市科技股份有限公司 Webpage text content extraction method and device, electronic equipment and storage medium
CN113312568A (en) * 2021-03-25 2021-08-27 罗普特科技集团股份有限公司 Web information extraction method and system based on HTML source code and webpage snapshot
CN113312568B (en) * 2021-03-25 2022-06-17 罗普特科技集团股份有限公司 Web information extraction method and system based on HTML source code and webpage snapshot
CN113687831A (en) * 2021-07-07 2021-11-23 杭州未名信科科技有限公司 Method and device for generating data acquisition script, computer equipment and storage medium
CN116975410A (en) * 2023-09-22 2023-10-31 北京中关村科金技术有限公司 Webpage data acquisition method and device, electronic equipment and readable storage medium
CN116975410B (en) * 2023-09-22 2023-12-19 北京中关村科金技术有限公司 Webpage data acquisition method and device, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN111581476A (en) Intelligent webpage information extraction method based on BERT and LSTM
CN110598000A (en) Relationship extraction and knowledge graph construction method based on deep learning model
CN110134771A (en) A kind of implementation method based on more attention mechanism converged network question answering systems
CN110516256A (en) A kind of Chinese name entity extraction method and its system
CN106383816B (en) The recognition methods of Chinese minority area place name based on deep learning
CN103823824B (en) A kind of method and system that text classification corpus is built automatically by the Internet
CN108804612B (en) Text emotion classification method based on dual neural network model
CN109493265A (en) A kind of Policy Interpretation method and Policy Interpretation system based on deep learning
CN107203511A (en) A kind of network text name entity recognition method based on neutral net probability disambiguation
CN108804689A (en) The label recommendation method of the fusion hidden connection relation of user towards answer platform
CN108984775B (en) Public opinion monitoring method and system based on commodity comments
CN112417880A (en) Court electronic file oriented case information automatic extraction method
CN112883714B (en) ABSC task syntactic constraint method based on dependency graph convolution and transfer learning
CN107491655A (en) Liver diseases information intelligent consultation method and system based on machine learning
CN111881398B (en) Page type determining method, device and equipment and computer storage medium
CN115017513A (en) Intelligent contract vulnerability detection method based on artificial intelligence
CN114169447B (en) Event detection method based on self-attention convolution bidirectional gating cyclic unit network
CN113779249B (en) Cross-domain text emotion classification method and device, storage medium and electronic equipment
CN117539996A (en) Consultation question-answering method and system based on user portrait
Schicchi et al. Attention-based model for evaluating the complexity of sentences in English language
CN113343665B (en) Commodity comment emotion analysis method and system based on aspect-level fine granularity
CN117077631A (en) Knowledge graph-based engineering emergency plan generation method
Xu et al. Automatic task requirements writing evaluation via machine reading comprehension
CN110413789A (en) A kind of exercise automatic classification method based on SVM
CN114330350B (en) Named entity recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination