WO2023155303A1 - 网页数据的提取方法和装置、计算机设备、存储介质 - Google Patents
网页数据的提取方法和装置、计算机设备、存储介质 Download PDFInfo
- Publication number
- WO2023155303A1 WO2023155303A1 PCT/CN2022/090719 CN2022090719W WO2023155303A1 WO 2023155303 A1 WO2023155303 A1 WO 2023155303A1 CN 2022090719 W CN2022090719 W CN 2022090719W WO 2023155303 A1 WO2023155303 A1 WO 2023155303A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sample
- path
- node
- sequence
- target
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 80
- 238000013075 data extraction Methods 0.000 title claims abstract description 22
- 238000012216 screening Methods 0.000 claims abstract description 46
- 238000000605 extraction Methods 0.000 claims abstract description 11
- 238000012549 training Methods 0.000 claims description 86
- 239000013598 vector Substances 0.000 claims description 58
- 238000012545 processing Methods 0.000 claims description 41
- 230000015654 memory Effects 0.000 claims description 23
- 230000006870 function Effects 0.000 claims description 22
- 239000000284 extract Substances 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims 3
- 238000013473 artificial intelligence Methods 0.000 abstract description 12
- 238000005516 engineering process Methods 0.000 description 16
- 230000008569 process Effects 0.000 description 11
- 238000004891 communication Methods 0.000 description 8
- 238000002372 labelling Methods 0.000 description 5
- 238000003058 natural language processing Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000013515 script Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 2
- XEEYBQQBJWHFJM-UHFFFAOYSA-N Iron Chemical compound [Fe] XEEYBQQBJWHFJM-UHFFFAOYSA-N 0.000 description 2
- 239000008186 active pharmaceutical agent Substances 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 238000010845 search algorithm Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 229910052742 iron Inorganic materials 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9027—Trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/9035—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/906—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Definitions
- the present application relates to the technical field of artificial intelligence, in particular to a method and device for extracting webpage data, computer equipment, and storage media.
- the embodiment of the present application proposes a method for extracting web page data, the method comprising:
- the node sequence includes a root node and a plurality of label nodes
- each of the node paths is a path from each of the label nodes to the root node;
- the embodiment of the present application proposes a device for extracting web page data, including:
- the first obtaining module used to obtain the source code data of the target webpage
- Data parsing module for parsing and processing the source code data to obtain a corresponding DOM tree
- a traversal module used for traversing the DOM tree to obtain a corresponding node sequence; wherein, the node sequence includes a root node and a plurality of label nodes;
- the second obtaining module used to obtain multiple node paths of the node sequence; wherein, each of the node paths is a path from each of the label nodes to the root node;
- the third obtaining module used to obtain the first target path from a preset sample set according to the plurality of node paths;
- Path screening module used to input the first target path to the pre-training model for path screening processing to obtain the second target path;
- Data extraction module extract corresponding target webpage data from the source code data according to the second target path.
- the embodiment of the present application provides a computer device, the computer device includes a memory and a processor, wherein a computer program is stored in the memory, and when the computer program is executed by the processor, the The processor is used to execute a method for extracting webpage data, wherein the method for extracting webpage data includes:
- the node sequence includes a root node and a plurality of label nodes
- each of the node paths is a path from each of the label nodes to the root node;
- the embodiment of the present application provides a storage medium, the storage medium is a computer-readable storage medium, and the storage medium stores computer-executable instructions, and the computer-executable instructions are used to cause a computer to execute a webpage
- the extraction method of data wherein, the extraction method of described webpage data comprises:
- the node sequence includes a root node and a plurality of label nodes
- each of the node paths is a path from each of the label nodes to the root node;
- the webpage data extraction method and device, computer equipment, and storage medium proposed in the embodiments of the present application obtain the source code data of the target webpage, analyze the source code data to obtain the corresponding DOM tree; perform traversal processing on the DOM tree to obtain the corresponding node Sequence, wherein the node sequence includes a root node and multiple label nodes; obtain multiple node paths of the node sequence, wherein each node path is the path from each label node to the root node; according to multiple node paths from the preset sample set Obtain the first target path, input the first target path into the pre-training model for path screening processing, and obtain the second target path; extract the corresponding target web page data from the source code data according to the second target path.
- a pre-training model is used to analyze the label nodes in the first target path, so that the second target path can be screened out from the first target path based on the same pre-training model based on the same type of web page, and the second target path can be used. Extract directly from the source code data to the target web page data, without manually constructing a special path template, thereby improving the efficiency of web page data extraction.
- Fig. 1 is the first flowchart of the method for extracting web page data provided by the embodiment of the present application
- Fig. 2 is the flowchart of step S105 in Fig. 1;
- Fig. 3 is the second flowchart of the method for extracting webpage data provided by the embodiment of the present application.
- Fig. 4 is the third flowchart of the method for extracting webpage data provided by the embodiment of the present application.
- Fig. 5 is a flowchart of step S403 in Fig. 4;
- Fig. 6 is the fourth flowchart of the method for extracting webpage data provided by the embodiment of the present application.
- Fig. 7 is the fifth flowchart of the method for extracting web page data provided by the embodiment of the present application.
- FIG. 8 is a block diagram of a module structure of a device for extracting webpage data provided by an embodiment of the present application.
- FIG. 9 is a schematic diagram of a hardware structure of a computer device provided by an embodiment of the present application.
- Artificial Intelligence It is a new technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science. Intelligence attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a manner similar to human intelligence. Research in this field includes robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
- Hyper Text Markup Language It is a markup language. It includes a series of tags. Through these tags, the document format on the network can be unified, and the scattered Internet resources can be connected into a logical whole.
- HTML text is a descriptive text composed of HTML commands, which can explain text, graphics, animations, sounds, tables, links, etc.
- Hypertext is a way of organizing information. It associates text, graphics and other information media in the text through hyperlinks. These interrelated information media may be in the same text, or may be other files, or files on a computer located at a geographically distant location.
- XPath is a language used to determine the location of a certain part of an XML document. XPath is based on the tree structure of XML, which provides the ability to find nodes in the data structure tree. XPath was originally regarded as a general syntax model between XPointer and XSL; currently XPath is adopted by developers as a small query language. Selecting Nodes XPath uses path expressions to select nodes in an XML document. Nodes are selected by following a path or step.
- LXML It is a third-party parsing library for Python, written entirely in Python language, which provides good support for XPath expressions, so it can efficiently parse HTML and XML documents.
- Python is a programming language that provides efficient high-level data structures for simple and effective object-oriented programming.
- Web crawler It is a program or script that automatically grabs information on the World Wide Web according to certain rules.
- DOM Document Object Model
- label-studio It is a data labeling tool, which is used to connect various data imports, data labeling, and call the role of model training labeled data.
- text2vec It mainly provides a simple and efficient API framework for text analysis and natural language processing. Because it is written in C++, and many parts (such as GloVe) make full use of packages such as RcppParallel for parallel operations, the processing speed is accelerated. In addition, the sampling stream processor does not need to load all the data into the memory for analysis, and effectively uses the memory. It can be said that this package fully considers the reality of the huge amount of data processed by NLP.
- Encoding is to convert the input sequence into a fixed-length vector; decoding (decoder) is to convert the previously generated fixed vector into an output sequence; where the input sequence can be text, voice, image, video ; The output sequence can be text, image.
- BiLSTM Bi-directional Long Short-Term Memory: It is composed of forward LSTM and backward LSTM. It is very suitable for sequence labeling tasks with upper and lower relations, so it is often used to model context information in NLP.
- Conditional random field It is a discriminative probability model and a type of random field, which is often used to label or analyze sequence data, such as natural language text or biological sequences.
- Embedding is a kind of vector representation, which refers to representing an object with a low-dimensional vector, which can be a word, or a commodity, or a movie, etc.; the nature of this embedding vector is that it can Make the objects corresponding to the vectors with similar distances have similar meanings. For example, the distance between embedding (Avengers) and embedding (Iron Man) will be very close, but the distance between embedding (Avengers) and embedding (Gone with the Wind) will be farther away.
- Embedding is essentially a mapping from semantic space to vector space, while maintaining the relationship of the original sample in the semantic space as much as possible in the vector space.
- Embedding can encode an object with a low-dimensional vector and retain its meaning. It is often used in machine learning. In the process of building a machine learning model, the object is encoded as a low-dimensional dense vector and then passed to DNN to improve efficiency.
- Dropout is a technique to prevent model overfitting. It means that during the training process of the deep learning network, for the neural network unit, it is temporarily discarded from the network according to a certain probability, so that the model can be more accurate. Robust, because it does not depend too much on some local features (because local features may be discarded).
- Adam optimizer Combines the advantages of AdaGrad and RMSProp optimization algorithms. Considering the first-order moment estimation and the second-order moment estimation of the gradient comprehensively, the update step size is calculated.
- R-Drop Unlike traditional constraint methods that act on neurons or model parameters, R-Drop acts on the output layer of the model to make up for the inconsistency of Dropout during training and testing. That is, in each mini-batch, each data sample passes the same model with Dropout twice, and R-Drop uses KL-divergence to constrain the output of the two times to be consistent. Therefore, R-Drop constrains the output consistency of the two random sub-models due to Dropout.
- GNE GeneralNewsExtractor
- GNE It is a general news website text extraction module. It inputs the HTML of a news webpage, and outputs the text content, title, author, release time, image address in the text and the source code of the tag where the text is located.
- the embodiments of the present application provide a method and device for extracting web page data, computer equipment, and a storage medium, which can improve the efficiency of extracting web page data.
- the embodiments of the present application provide a method and device for extracting webpage data, computer equipment, and storage media, which are specifically described through the following embodiments. First, the method for extracting webpage data in the embodiments of the present application is described.
- AI artificial intelligence
- the embodiments of the present application may acquire and process relevant data based on artificial intelligence technology.
- artificial intelligence is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
- Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
- Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
- the method for extracting web page data provided in the embodiment of the present application relates to the field of artificial intelligence.
- the method for extracting web page data provided by the embodiment of the present application can be applied to a terminal or a server, and can also be software running on the terminal or the server.
- the terminal can be a smart phone, a tablet computer, a notebook computer, a desktop computer, or a smart watch;
- the server end can be configured as an independent physical server, or as a server cluster composed of multiple physical servers or as a distributed
- the system can also be configured to provide basic cloud computing such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms.
- the cloud server of the service; the software can be an application that realizes the method of extracting web page data, but is not limited to the above forms.
- the embodiments of the present application can be used in many general-purpose or special-purpose computer system environments or configurations. Examples: personal computers, server computers, handheld or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, including A distributed computing environment for any of the above systems or devices, etc.
- This application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer.
- program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
- the application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
- program modules may be located in both local and remote computer storage media including storage devices.
- the method for extracting web page data according to the first aspect of the embodiment of the present application includes but is not limited to steps S101 to S107.
- Step S101 obtaining the source code data of the target web page
- Step S102 analyzing and processing the source code data to obtain a corresponding DOM tree
- Step S103 traversing the DOM tree to obtain the corresponding node sequence
- Step S104 obtaining multiple node paths of the node sequence
- Step S106 inputting the first target path into the pre-training model for path screening processing to obtain the second target path;
- Step S107 extracting corresponding target webpage data from the source code data according to the second target path.
- the source code data of the target webpage is obtained, wherein the target webpage refers to the webpage from which the user needs to extract data, such as a news webpage, etc., and the data to be extracted is such as the title, time and text of the webpage, etc.
- Content source code data refers to the HTML source code corresponding to the target webpage, including a series of webpage tags, through which the document format on the page can be unified, and scattered Internet resources can be connected into a logical whole.
- the HTML source code also includes script data and style sheet data, as well as many types of attribute values, including but not limited to ID, name, number, length unit, language, media descriptor, color, character encoding, date and time, etc.
- a web crawler tool may be used to crawl the HTML source code corresponding to the URL of the target web page.
- step S102 of some embodiments the source code data is parsed to obtain a corresponding DOM tree. Specifically, it can be divided into two steps: tag parsing and DOM tree construction. The specific process is as follows:
- Tag parsing step This step mainly completes the function of parsing out web page tags from the HTML source code, mainly using tokenization algorithms.
- the input result of the tokenization algorithm is an HTML markup, which is represented by a state machine.
- the state machine has four states: data state (Data), tag open state (Tagopen), tag name state (Tag name), and close tag open state (Close tag open state).
- the initial state of the state machine is the data state.
- the state changes to the mark open state; when a character ranging from “a” to "z” is received , needs to create the start tag and change the state to the tag name state until the character ">” is received.
- the string in this period will form a new tag name, when the ">" tag is received, the current new tag is sent to the tree builder, and the state of the state machine is changed to the data state.
- the state machine creates the close tag open state, and changes to the tag name state until the character ">” is received, then sends the current new tag to the tree builder, and the state machine status to data status.
- each character is created into a character token and sent to the tree builder.
- the tag parser parses out the web page tags, it will send the web page tags to the DOM tree builder, wherein the DOM tree builder is mainly composed of a DOM tree and a stack for storing web page tag names. Specifically, after the DOM tree builder receives the initial tag name sent by the tag parser, it will add it to the stack. Assuming that the current stack stores three tags ⁇ html> ⁇ body> in turn, continue to Next, when a ⁇ /h1> is received from the state machine, since ⁇ /h1> belongs to the end tag, at this time, the tag in the stack is queried.
- the tag on the top of the stack and the incoming end tag belong to the same type of tag , such as ⁇ h1>, add this node to the DOM tree after popping the tag, and then continue to parse down.
- the stack is empty, that is, the ⁇ html> root node is also added to the DOM tree, indicating that the DOM tree is built.
- the DOM tree is traversed to obtain the corresponding node sequence.
- a depth-first search algorithm or a breadth-first search algorithm can be used to traverse the DOM tree to obtain a corresponding node sequence, wherein the node sequence includes a root node and multiple leaf nodes, and the leaf nodes are also referred to in the embodiments of this application.
- the label node of represents a web page label in the HTML source code, such as ⁇ h1>.
- step S104 of some embodiments multiple node paths of the node sequence are acquired; wherein, each node path is a path from each label node to the root node. For example, if a label node is ⁇ h1>, its path to the root node ⁇ html> can be expressed as "/html/body/div/h1".
- the first target path is obtained from a preset sample set according to multiple node paths, wherein the preset sample set is a plurality of pre-collected sample paths, and each first target path is related to a certain The same or similar node path is used as the input of the pre-training model.
- the method of obtaining the sample path can refer to step S101 to step S104 in the embodiment of this application, and the data source corresponding to the sample path can be a news webpage, which includes public opinion news and policy news, etc., public opinion news and Policy news corresponds to multiple webpage sources, that is, source code data corresponding to multiple webpages.
- the first target path is input into the pre-trained model for path screening processing to obtain the second target path, where there may be one or multiple second target paths. It should be noted that not every first target path can parse out the required webpage data.
- the purpose of the pre-training model is to select the second target that can effectively extract webpage data from the first target path, such as the text of the webpage. A path, ensuring that the webpage data extracted according to the second target path is data corresponding to the text of the webpage.
- the corresponding target webpage data is extracted from the source code data according to the second target path. Specifically, the second path is restored to the corresponding DOM number by post-order traversal, and the corresponding target webpage data is output in sequence, such as the text, title and time of the target webpage.
- step S105 specifically includes, but is not limited to, steps S201 to S202.
- Step S201 acquiring a sample set
- Step S202 acquiring the same sample path as the node path from the sample set as the first target path.
- a sample set collected in advance is obtained, wherein the sample set includes a plurality of sample paths; the sample paths are used to build a pre-training model. After obtaining the node path of the target web page, it first needs to be matched with the sample paths in the sample set.
- step S202 of some embodiments if the same sample path as the node path is found in the sample set, the node path is used as the first target path. If no sample path identical to the node path is found in the sample set, execute steps S301 to S303. It should be noted that after obtaining the node path, it cannot be directly used as the input of the pre-training model, and it is necessary to find a sample path that is the same as or similar to each node path in the sample set, that is, the first target path is used as the input of the pre-training model. input, so as to ensure that the pre-training model can output the corresponding webpage data after training through the first target path.
- the method for extracting web page data in this embodiment of the present application further includes but not limited to steps S301 to S303.
- Step S301 obtaining a first path from multiple node paths
- Step S302 calculating the similarity between the first path and each sample path
- Step S303 taking the sample path corresponding to the maximum similarity as the first target path.
- step S301 of some embodiments a first path different from all sample paths in the sample set is found from multiple node paths.
- step S302 of some embodiments for each first path, it is necessary to calculate the similarity between the first path and each sample path, wherein the similarity is obtained by combining the corresponding path node in the first path with each sample After comparing the corresponding sample nodes in the paths, after calculating the similarities between all the first paths and each sample path, multiple similarities are obtained.
- the sample path corresponding to the maximum similarity is used as the first target path, in other words, the sample path with the maximum similarity to the node path is selected from the sample set as the first target path, and The first target path is used as the input of the pre-trained model. It should be noted that if a similar first target path is not found from the sample set based on the node path to replace it, since the pre-training model has not trained the node path in advance, the corresponding web page data may not be extracted, which affects Accuracy of web page data extraction.
- the method for extracting webpage data in the embodiment of the present application further includes: building a pre-training model, specifically including but not limited to steps S401 to S404.
- Step S401 obtaining training samples
- Step S402 input the sample sequence and sample features into the original training model
- Step S403 according to the sample sequence and sample features, calculate the loss function of the original training model to obtain the loss value;
- Step S404 updating the original training model according to the loss value to obtain a pre-training model.
- a training sample is obtained, wherein the training sample includes a sample sequence and corresponding sample features.
- sample data is collected first, and the sample data is to collect source code data of multiple webpages, and then the source code data is parsed to obtain a DOM tree (including a parent node and multiple sample nodes), and the DOM tree is traversed to obtain each For the sample sequences x 1 , x 2 , .
- DOM tree including a parent node and multiple sample nodes
- the most important sample feature of the embodiment of the present application is the path from the parent node (generally an html tag) in the DOM tree to the current sample node, for example, the current sample node is x 1 , and the label sequence that x 1 may correspond to is "/html/body/div/h1", where each sample node corresponds to a sample label, and the sample label corresponding to sample node x 1 in the above example is "h1".
- each sample sequence is analogous to a certain word in an English sentence, and the sample labels in the sample sequence correspond to the letters in a certain word.
- sample features of the embodiment of the present application can also be some additional features, and additional features are extracted according to the text data corresponding to the current node, such as the number of punctuation marks, the number of function words, whether it contains the "h1" tag, whether it contains "p” label, vector representation of text, etc., where additional features have a strong correlation with the category of sample nodes, for example, titles are generally in tags such as "h1" and "h2"; the vector representation of text can be obtained by using open source The text representation tool text2vec is obtained.
- data labeling of the sequence is also required. Specifically, if the webpage data extracted by the pre-training model is set as the title, time and text of the webpage, then Before the model is built, it is necessary to mark the three fields of title, time and text from the sample sequence to obtain the marked sample sequence. In practical applications, tools such as lable-studio can be used for manual labeling.
- step S402 of some embodiments the sample sequence and sample features are input into the original training model.
- the skeleton of the original training model used is BiLSTM+CRF.
- the loss function of the original training model is calculated according to the sample sequence and sample features to obtain a loss value.
- the specific loss function used is the CRF loss function.
- the original training model is updated according to the loss value to obtain a pre-training model. Specifically, during the training process, the loss function of the original training model is corrected, so that the original training model is trained according to the target loss value, optimized towards a new target, and the optimized original training model is obtained, that is, as mentioned in the embodiment of this application pre-trained model.
- step S403 specifically includes, but is not limited to, steps S501 to S505.
- Step S501 the sample sequence is encoded to obtain a sequence vector, and the sample features are encoded to obtain a feature vector;
- Step S502 splicing the sequence vector and the feature vector to obtain a splicing vector
- Step S503 performing screening processing on the concatenated vector according to a preset screening rate to obtain a screening vector
- Step S504 performing field classification processing on the screening vector according to the preset classification fields to obtain corresponding classification data
- step S505 the loss function of the original training model is calculated according to the classification data to obtain a loss value.
- the sample sequence is encoded to obtain a sequence vector
- the sample features are encoded to obtain a feature vector.
- the sample sequence x 1 , x 2 , ...,x n maps to E(x 1 ), E(x 2 ),...,E(x n ), that is, a sequence vector.
- step S502 of some embodiments the feature vectors are concatenated into the sequence vector E(x n ) to obtain the concatenated vector E concat ( xi ).
- step S503 of some embodiments the concatenated vector E concat ( xi ) is input to the dropout layer of the original training model, and the dropout layer performs screening processing on the concatenated vector according to the screening rate to obtain the screened vector. Specifically, the dropout layer randomly sets some neurons to 0 according to the screening rate, and this step plays a role of regularization.
- step S504 of some embodiments field classification processing is performed on the screening vector according to preset classification fields to obtain corresponding classification data. Specifically, input the screening vector obtained in step S503 into the BiLSTM layer of the original training model, and set the dimension of the hidden layer of BiLSTM, for example, 150, and then connect another dropout layer with a set screening rate, and then use A fully connected layer splices the previous vectors and enters the CRF layer, where the CRF layer outputs three categories of information based on the preset classification fields and the information marked in advance on the sample sequence, such as title, time, and text. Categorical data.
- the loss function of the original training model is calculated according to the classification data to obtain a loss value.
- the loss function of the original training model can be selected as the CRF loss function. After calculating the loss value, perform backpropagation to adjust the weights of each neural network in the original training model, so as to obtain a well-trained pre-training model.
- the gradient of the loss function for each parameter is calculated, and then the parameters are updated according to the rules set by the optimizer according to the gradient value of the parameter and the learning rate.
- the Adam optimizer is used to train the original training model.
- the number of samples can be set to 32, the learning rate is set to 0.001, and the R-Drop technology is used to add penalty items to the original training model.
- the webpage after analyzing the information of the webpage, it will be found that the webpage contains a large amount of noise content irrelevant to the subject of the webpage, such as copyright information, advertisement links and navigation bars, etc. During the process of webpage data extraction, these webpage Noise will affect the extraction effect, so it is necessary to preprocess the webpage by denoising.
- the sample sequence includes a parent node and a plurality of sample nodes, and each sample node includes a web page label; as shown in FIG. It is limited to step S601 to step S604.
- Step S601 acquiring multiple sample paths of the sample sequence
- Step S602 acquiring preset irrelevant tags
- Step S603 obtaining a second path from multiple sample paths according to the irrelevant label
- Step S604 delete the sample node corresponding to the second path, so as to update the sample sequence.
- step S601 of some embodiments multiple sample paths of the sample sequence are acquired, wherein each sample path is a path from each sample node to a parent node.
- step S603 of some embodiments for each sample path, whether each webpage label under the sample path is an irrelevant label, if one or more webpage labels under a sample path are irrelevant labels, the sample path Marked as the second path.
- step S604 of some embodiments the sample node corresponding to the second path is deleted, so as to update the sample sequence. Since the irrelevant tags have a low correlation with the subject content of the webpage, this part of the content is filtered out before training the original training model to remove irrelevant noise content, thereby improving the accuracy of extracting webpage data.
- the method for extracting web page data in this embodiment of the present application further includes, but is not limited to, steps S701 to S703 .
- step S701 of some embodiments web page time data representing time in the target web page data is obtained.
- the webpage time data is standardized according to a preset data format to obtain standard time data, for example, the preset data format is "year/month/day", and the extracted webpage time data For "2021-10-24 17:12:00", the time data on the web page needs to be adjusted according to the data format of "year/month/day” to obtain the standard time data, namely "2021/10/24".
- step S703 of some embodiments the web page time data is updated to standard time data.
- standardization processing is performed on the web page time data to facilitate subsequent database storage.
- the embodiment of the present application in addition to using the method of BiLSTM+CRF model to extract web page data, also combines the open source GNE module to extract web page data.
- the purpose of extracting webpage data based on the BiLSTM+CRF model combined with the GNE module is to prevent the webpage data extracted by the BiLSTM+CRF model from being incomplete, for example, the text data cannot be extracted.
- the GNE module can be used to extract the corresponding text data, thereby ensuring that the web page data corresponding to the target web page can be completely extracted.
- the embodiment of the present application further improves the accuracy of web page data extraction by combining traditional statistical methods and methods based on deep learning.
- the webpage data extraction method proposed in the embodiment of the present application obtains the source code data of the target webpage, and analyzes the source code data to obtain a corresponding DOM tree; traverses the DOM tree to obtain a corresponding node sequence, wherein the node sequence includes a root node and multiple label nodes; obtain multiple node paths of the node sequence, wherein each node path is the path from each label node to the root node; obtain the first target path from the preset sample set according to the multiple node paths, and the second A target path is input to the pre-training model for path screening processing to obtain a second target path; and the corresponding target web page data is extracted from the source code data according to the second target path.
- a pre-training model is used to analyze the label nodes in the first target path, so that the second target path can be screened out from the first target path based on the same pre-training model based on the same type of web page, and the second target path can be used. Extract directly from the source code data to the target web page data, without manually constructing a special path template, thereby improving the efficiency of web page data extraction.
- the embodiment of the present application also provides a device for extracting webpage data.
- the above method for extracting webpage data can be realized.
- 803, the second acquisition module 804, the third acquisition module 805, the path screening module 806 and the data extraction module 807 the first acquisition module 801 is used to acquire the source code data of the target webpage;
- the data analysis module 802 is used to analyze the source code data , to obtain the corresponding DOM tree;
- the traversal module 803 is used to traverse the DOM tree to obtain the corresponding node sequence; wherein the node sequence includes a root node and a plurality of label nodes;
- the second acquisition module 804 is used to obtain a plurality of node sequences Node path; wherein each node path is the path from each label node to the root node;
- the third acquisition module 805 is used to obtain the first target path from a preset sample set according to multiple node paths;
- the path screening module 806 is used to The first target path is input to the pre-training model for path
- the apparatus for extracting webpage data in the embodiment of the present application is used to execute the method for extracting webpage data in the above-mentioned embodiment, and its specific processing process is the same as the method for extracting webpage data in the above-mentioned embodiment, and will not be repeated here.
- the embodiment of the present application also provides a computer device, including:
- At least one processor and,
- the memory stores instructions, and the instructions are executed by at least one processor, so that when the at least one processor executes the instructions, a method for extracting webpage data is implemented, wherein the method for extracting webpage data includes:
- the node sequence includes a root node and multiple label nodes
- each node path is a path from each label node to the root node;
- Corresponding target web page data is extracted from the source code data according to the second target path.
- the computer device includes: a processor 901 , a memory 902 , an input/output interface 903 , a communication interface 904 and a bus 905 .
- the processor 901 may be implemented by a general-purpose central processing unit (Central Processin Unit, CPU), a microprocessor, an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute Relevant programs to realize the technical solutions provided by the embodiments of the present application;
- CPU Central Processin Unit
- ASIC Application Specific Integrated Circuit
- the memory 902 may be implemented in the form of a read-only memory (Read Only Memory, ROM), a static storage device, a dynamic storage device, or a random access memory (Random Access Memory, RAM).
- the memory 902 can store operating systems and other application programs. When implementing the technical solutions provided by the embodiments of this specification through software or firmware, the relevant program codes are stored in the memory 902 and called by the processor 901 to execute the implementation of this application.
- the input/output interface 903 is used to realize information input and output
- the communication interface 904 is used to realize the communication interaction between the device and other devices, and the communication can be realized through a wired method (such as USB, network cable, etc.), or can be realized through a wireless method (such as a mobile network, WIFI, Bluetooth, etc.); and
- bus 905 for transferring information between various components of the device (such as processor 901, memory 902, input/output interface 903 and communication interface 904);
- the processor 901 , the memory 902 , the input/output interface 903 and the communication interface 904 are connected to each other within the device through the bus 905 .
- the embodiment of the present application also provides a storage medium, the storage medium is a computer-readable storage medium, and the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to make a computer execute a kind of web page data extraction method, wherein the method for extracting web page data includes:
- the node sequence includes a root node and multiple label nodes
- each node path is a path from each label node to the root node;
- Corresponding target web page data is extracted from the source code data according to the second target path.
- the computer-readable storage medium may be non-volatile or volatile.
- memory can be used to store non-transitory software programs and non-transitory computer-executable programs.
- the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices.
- the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
- the webpage data extraction method, webpage data extraction device, computer equipment, and storage medium proposed in the embodiments of the present application obtain the source code data of the target webpage, and analyze the source code data to obtain a corresponding DOM tree; perform traversal processing on the DOM tree, Obtain the corresponding node sequence, and obtain multiple node paths of the node sequence, where each node path is the path from each label node of the node sequence to the root node, and obtain the first node path from the preset sample set according to the multiple node paths Target path, input the first target path into the pre-training model for path screening processing, obtain the second target path, and extract the corresponding target webpage data from the source code data according to the second target path; thus, the first target can be analyzed through the pre-training model Label nodes in the path, so that the second target path can be filtered out from the first target path based on the same pre-training model and the same type of web page, and the source code data can be directly extracted to the target web page data through the second target path.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Information Transfer Between Computers (AREA)
Abstract
本实施例提供一种基于网页数据的提取方法和装置、计算机设备、存储介质,属于人工智能技术领域。包括:获取目标网页的源码数据,对源码数据解析得到DOM树;对DOM树进行遍历得到节点序列,节点序列包括根节点和多个标签节点;获取节点序列的多个节点路径,每一节点路径为每一标签节点到根节点的路径;根据多个节点路径从预设的样本集中获取第一目标路径,将第一目标路径输入至预训练模型进行筛选得到第二目标路径;根据第二目标路径从源码数据提取对应的目标网页数据。通过预训练模型分析标签节点情况,根据同一类型的网页从第一目标路径筛选出第二目标路径,从而提取目标网页数据,不需要人工构建专门的路径模板,提高网页数据的提取效率。
Description
本申请要求于2022年02月16日提交中国专利局、申请号为202210143571.5,发明名称为“网页数据的提取方法和装置、计算机设备、存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及人工智能技术领域,尤其涉及一种网页数据的提取方法和装置、计算机设备、存储介质。
随着互联网技术的发展,用户对于网络信息的使用需求也越来越高,例如,用户需要从网页中提取相关的网页数据。通常,对于网页数据的提取,需要根据相应的网页进行人工配置路径模板,通过配置好的路径模板去提取对应网页中的网页数据。
以下是发明人意识到的现有技术的技术问题:采用人工配置路径模板来提取网页数据的方式,提取效率低。
第一方面,本申请实施例提出了一种网页数据的提取方法,所述方法包括:
获取目标网页的源码数据;
对所述源码数据进行解析处理,得到对应的DOM树;
对所述DOM树进行遍历处理,得到对应的节点序列;其中,所述节点序列包括根节点和多个标签节点;
获取所述节点序列的多个节点路径;其中,每一所述节点路径为每一所述标签节点到所述根节点的路径;
根据多个所述节点路径从预设的样本集中获取第一目标路径;
将所述第一目标路径输入至预训练模型进行路径筛选处理,得到第二目标路径;
根据所述第二目标路径从所述源码数据提取对应的目标网页数据。
第二方面,本申请实施例提出了一种网页数据的提取装置,包括:
第一获取模块:用于获取目标网页的源码数据;
数据解析模块:用于对所述源码数据进行解析处理,得到对应的DOM树;
遍历模块:用于对所述DOM树进行遍历处理,得到对应的节点序列;其中,所述节点序列包括根节点和多个标签节点;
第二获取模块:用于获取所述节点序列的多个节点路径;其中,每一所述节点路径为每一所述标签节点到所述根节点的路径;
第三获取模块:用于根据多个所述节点路径从预设的样本集中获取第一目标路径;
路径筛选模块:用于将所述第一目标路径输入至预训练模型进行路径筛选处理,得到第二目标路径;
数据提取模块:根据所述第二目标路径从所述源码数据提取对应的目标网页数据。
第三方面,本申请实施例提出了一种计算机设备,所述计算机设备包括存储器和处理器,其中,所述存储器中存储有计算机程序,所述计算机程序被所述处理器执行时,所述处理器用于执行一种网页数据的提取方法,其中,所述网页数据的提取方法包括:
获取目标网页的源码数据;
对所述源码数据进行解析处理,得到对应的DOM树;
对所述DOM树进行遍历处理,得到对应的节点序列;其中,所述节点序列包括根节点和多个标签节点;
获取所述节点序列的多个节点路径;其中,每一所述节点路径为每一所述标签节点到所述根节点的路径;
根据多个所述节点路径从预设的样本集中获取第一目标路径;
将所述第一目标路径输入至预训练模型进行路径筛选处理,得到第二目标路径;
根据所述第二目标路径从所述源码数据提取对应的目标网页数据。
第四方面,本申请实施例提出了一种存储介质,该存储介质为计算机可读存储介质,所述存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行一种网页数据的提取方法,其中,所述网页数据的提取方法包括:
获取目标网页的源码数据;
对所述源码数据进行解析处理,得到对应的DOM树;
对所述DOM树进行遍历处理,得到对应的节点序列;其中,所述节点序列包括根节点和多个标签节点;
获取所述节点序列的多个节点路径;其中,每一所述节点路径为每一所述标签节点到所述根节点的路径;
根据多个所述节点路径从预设的样本集中获取第一目标路径;
将所述第一目标路径输入至预训练模型进行路径筛选处理,得到第二目标路径;
根据所述第二目标路径从所述源码数据提取对应的目标网页数据。
本申请实施例提出的网页数据的提取方法和装置、计算机设备、存储介质,通过获取目标网页的源码数据,对源码数据进行解析得到对应的DOM树;对DOM树进行遍历处理,得到对应的节点序列,其中节点序列包括根节点和多个标签节点;获取节点序列的多个节点路径,其中每一节点路径为每一标签节点到根节点的路径;根据多个节点路径从预设的样本集中获取第一目标路径,将第一目标路径输入至预训练模型进行路径筛选处理,得到第二目标路径;根据第二目标路径从源码数据提取对应的目标网页数据。本申请实施例通过预训练模型分析第一目标路径中的标签节点情况,从而可以基于同一预训练模型根据同一类型的网页从第一目标路径筛选出第二目标路径,通过第二目标路径就能从源码数据直接提取到目标网页数据,不需要人工构建专门的路径模板,从而提高网页数据的提取效率。
图1是本申请实施例提供的网页数据的提取方法的第一流程图;
图2是图1中的步骤S105的流程图;
图3是本申请实施例提供的网页数据的提取方法的第二流程图;
图4是本申请实施例提供的网页数据的提取方法的第三流程图;
图5是图4中的步骤S403流程图;
图6是本申请实施例提供的网页数据的提取方法的第四流程图;
图7是本申请实施例提供的网页数据的提取方法的第五流程图;
图8是本申请实施例提供的网页数据的提取装置的模块结构框图;
图9是本申请实施例提供的计算机设备的硬件结构示意图。
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申 请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
需要说明的是,虽然在装置示意图中进行了功能模块划分,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于装置中的模块划分,或流程图中的顺序执行所示出或描述的步骤。说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
此外,所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中,提供许多具体细节从而给出对本申请的实施例的充分理解。然而,本领域技术人员将意识到,可以实践本申请的技术方案而没有特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知方法、装置、实现或者操作以避免模糊本申请的各方面。
附图中所示的方框图仅仅是功能实体,不一定必须与物理上独立的实体相对应。即,可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解,而有的操作/步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。
首先,对本申请中涉及的若干名词进行解析:
人工智能(artificial intelligence,AI):是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的一门新的技术科学;人工智能是计算机科学的一个分支,人工智能企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器,该领域的研究包括机器人、语言识别、图像识别、自然语言处理和专家系统等。人工智能可以对人的意识、思维的信息过程的模拟。人工智能还是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。
超文本标记语言(Hyper Text Markup Language,HTML):是一种标记语言。它包括一系列标签。通过这些标签可以将网络上的文档格式统一,使分散的Internet资源连接为一个逻辑整体。HTML文本是由HTML命令组成的描述性文本,HTML命令可以说明文字,图形、动画、声音、表格、链接等。超文本是一种组织信息的方式,它通过超级链接方法将文本中的文字、图表与其他信息媒体相关联。这些相互关联的信息媒体可能在同一文本中,也可能是其他文件,或是地理位置相距遥远的某台计算机上的文件。
XML路径语言(XML Path Language,XPath):XPath是一种用来确定XML文档中某部分位置的语言。XPath基于XML的树状结构,提供在数据结构树中找寻节点的能力。起初XPath被视为一个通用的、介于XPointer与XSL间的语法模型;当前XPath被开发者采用来当作小型查询语言。选取节点XPath使用路径表达式在XML文档中选取节点。节点是通过沿着路径或者step来选取的。
LXML:是Python的第三方解析库,完全使用Python语言编写,它对Xpath表达式提供了良好的支持,因此能够高效地解析HTML和XML文档。
Python:是一种编程语言,它提供高效的高级数据结构,能简单有效地面向对象编程。
网络爬虫(web crawler):是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本。
文档对象模型(Document Object Model,DOM):是W3C制定的标准接口规范,是一种处理HTML和XML文件的标准API。DOM提供了对整个文档的访问模型,将文档作为一个树形 结构,树的每个结点表示了一个HTML标签或标签内的文本项。DOM树结构精确地描述了HTML文档中标签间的相互关联性。将HTML或XML文档转化为DOM树的过程称为解析(parse)。HTML文档被解析后,转化为DOM树,因此对HTML文档的处理可以通过对DOM树的操作实现。DOM模型不仅描述了文档的结构,还定义了结点对象的行为,利用对象的方法和属性,可以方便地访问、修改、添加和删除DOM树的结点和内容。
label-studio:是一种数据标注工具,用于连接各个数据导入、数据标注,调用模型训练标注好的数据的作用。
text2vec:主要是为文本分析和自然语言处理提供了一个简单高效的API框架。由于其由C++所写,同时许多部分(例如GloVe)都充分运用RcppParallel等包进行并行化操作,处理速度得到加速。并且采样流处理器,可以不必把全部数据载入内存才进行分析,有效利用了内存,可以说该包是充分考虑了NLP处理数据量庞大的现实。
编码(Encoder):编码,就是将输入序列转化成一个固定长度的向量;解码(decoder),就是将之前生成的固定向量再转化成输出序列;其中,输入序列可以是文字、语音、图像、视频;输出序列可以是文字、图像。
BiLSTM(Bi-directional Long Short-Term Memory):由前向LSTM与后向LSTM组合而成,其很适合做上下有关系的序列标注任务,因此在NLP中常被用来建模上下文信息。
条件随机场(conditional random field,CRF):是一种鉴别式机率模型,是随机场的一种,常用于标注或分析序列资料,如自然语言文字或是生物序列。
嵌入(embedding):embedding是一种向量表征,是指用一个低维的向量表示一个物体,该物体可以是一个词,或是一个商品,或是一个电影等等;这个embedding向量的性质是能使距离相近的向量对应的物体有相近的含义,比如embedding(复仇者联盟)和embedding(钢铁侠)之间的距离就会很接近,但embedding(复仇者联盟)和embedding(乱世佳人)的距离就会远一些。embedding实质是一种映射,从语义空间到向量空间的映射,同时尽可能在向量空间保持原样本在语义空间的关系,如语义接近的两个词汇在向量空间中的位置也比较接近。embedding能够用低维向量对物体进行编码还能保留其含义,常应用于机器学习,在机器学习模型构建过程中,通过把物体编码为一个低维稠密向量再传给DNN,以提高效率。
dropout(丢弃):dropout是一种防止模型过拟合的技术,是指在深度学习网络的训练过程中,对于神经网络单元,按照一定的概率将其暂时从网络中丢弃,从而可以让模型更鲁棒,因为它不会太依赖某些局部的特征(因为局部特征有可能被丢弃)。
全连接层:全连接层的每一个结点都与上一层的所有结点相连,用来把前边提取到的特征综合起来。由于其全相连的特性,一般全连接层的参数也是最多的。例如在VGG16中,第一个全连接层FC1有4096个节点,上一层POOL2是7*7*512=25088个节点,则该传输需要4096*25088个权值,需要耗很大的内存。
Adam优化器:结合AdaGrad和RMSProp两种优化算法的优点。对梯度的一阶矩估计和二阶矩估计进行综合考虑,计算出更新步长。
R-Drop:与传统作用于神经元或者模型参数上的约束方法不同,R-Drop作用于模型的输出层,弥补Dropout在训练和测试时的不一致性。即在每个mini-batch中,每个数据样本过两次带有Dropout的同一个模型,R-Drop再使用KL-divergence约束两次的输出一致。所以,R-Drop约束了由于Dropout带来的两个随机子模型的输出一致性。
GNE(GeneralNewsExtractor):是一个通用新闻网站正文抽取模块,输入一篇新闻网页的HTML,输出正文内容、标题、作者、发布时间、正文中的图片地址和正文所在的标签源代码。
随着互联网技术的发展,用户对于网络信息的使用需求也越来越高,例如,用户需要从网页中提取某些网页数据。对于网页数据的提取,通常需要人工根据不同的网页配置不同的路径模板,通过配置好的路径模板去提取对应网页中的网页数据。但是,采用人工配置路径模板来提取网页数据的方式,会造成巨大的人力成本,且提取效率不高。
基于此,本申请实施例提供一种网页数据的提取方法和装置、计算机设备、存储介质,能够提高网页数据的提取效率。
本申请实施例提供网页数据的提取方法和装置、计算机设备、存储介质,具体通过如下实施例进行说明,首先描述本申请实施例中的网页数据的提取方法。
本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。
人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
本申请实施例提供的网页数据的提取方法,涉及人工智能领域。本申请实施例提供的网页数据的提取方法可应用于终端中,也可应用于服务器端中,还可以是运行于终端或服务器端中的软件。在一些实施例中,终端可以是智能手机、平板电脑、笔记本电脑、台式计算机或者智能手表等;服务器端可以配置成独立的物理服务器,也可以配置成多个物理服务器构成的服务器集群或者分布式系统,还可以配置成提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN以及大数据和人工智能平台等基础云计算服务的云服务器;软件可以是实现网页数据的提取方法的应用等,但并不局限于以上形式。
本申请实施例可用于众多通用或专用的计算机系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
参照图1,根据本申请实施例第一方面实施例的网页数据的提取方法,包括但不限于包括步骤S101至步骤S107。
步骤S101,获取目标网页的源码数据;
步骤S102,对源码数据进行解析处理,得到对应的DOM树;
步骤S103,对DOM树进行遍历处理,得到对应的节点序列;
步骤S104,获取节点序列的多个节点路径;
步骤S105,根据多个节点路径从预设的样本集中获取第一目标路径;
步骤S106,将第一目标路径输入至预训练模型进行路径筛选处理,得到第二目标路径;
步骤S107,根据第二目标路径从源码数据提取对应的目标网页数据。
在一些实施例的步骤S101中,获取目标网页的源码数据,其中目标网页指的是用户所需要提取数据的网页,例如新闻类网页等,所需要提取的数据例如网页的标题、时间和正文等内容;源码数据指的是该目标网页所对应的HTML源码,包括一系列网页标签,通过这些网页标签可以将页面上的文档格式统一,使分散的Internet资源连接为一个逻辑整体。HTML源码还包括脚本数据和样式表的数据,以及众多类型的属性值,包括但不限于ID、名称、数字、长度单位、语言、媒体描述符、颜色、字符编码、日期和时间等。在实际应用中,可利用网络爬虫工具,爬取目标网页的URL所对应的HTML源码。
在一些实施例的步骤S102中,对源码数据进行解析处理,得到对应的DOM树。具体地可分为标签解析和DOM树构建两个步骤,其具体过程如下:
标签解析步骤:该步骤主要完成从HTML源码中解析出网页标签的功能,主要采用标记化算法。需要说明的是,标记化算法的输入结果是HTML标记,使用状态机表示。其中,状态机一共有4个状态:数据状态(Data)、标记打开状态(Tagopen)、标记名称状态(Tag name)、关闭标记打开状态(Close tag open state)。
具体地,状态机的初始状态是数据状态,当标记是处于数据状态且遇到字符“<”时,状态更改为标记打开状态;当接收到一个范围在“a”至“z”的字符时,需要创建起始标记,并将状态改为标记名称状态,并保持到接收字符“>”为止。在此期间的字符串会形成一个新的标记名称,当接收到“>”标记后,将当前的新标记发送给树构造器,并将状态机的状态改为数据状态。当接收到下一个输入字符“/”时,状态机会创建关闭标记打开状态,并更改为标记名称状态,直到接收字符“>”,接着将当前的新标记发送给树构造器,并将状态机的状态改为数据状态。此外,当状态机处于数据状态且遇到“a”至“z”字符时,将每个字符创建成字符标记,并发送给树构造器。
DOM树构建步骤:当标签解析器解析出网页标签后,会将网页标签发送到DOM树构建器,其中DOM树构建器主要由DOM树和一个用于存放网页标签名称的栈构成。具体地,DOM树构建器接收到标签解析器发来的起始标签名后,会将其加入到栈中,假设当前的栈依次存储有<html><body>三个标签,此时继续向下解析,当从状态机接收到一个</h1>,由于</h1>属于结束标签,此时则查询栈内的标签,如果栈顶的标签和传入的结束标签属于同一种类型的标签,例如<h1>,将该标签出栈后向DOM树加入此节点,接着继续向下解析。当栈为空,即<html>根节点也加入到DOM树中,表示DOM树构建完毕。
在实际应用中,还可以利用现有的解析工具,例如LXML对HTML源码进行解析得到DOM树。
在一些实施例的步骤S103中,对DOM树进行遍历处理,得到对应的节点序列。具体地,可以采用深度优先搜索算法或广度优先搜索算法对DOM树进行遍历处理,得到对应的节点序列,其中节点序列包括一个根节点和多个叶子节点,叶子节点也即本申请实施例提到的标签节点,表示HTML源码中的某个网页标签,例如<h1>。
在一些实施例的步骤S104中,获取节点序列的多个节点路径;其中,每一节点路径为每一标签节点到根节点的路径。例如,某一标签节点为<h1>,其到根节点<html>的路径可以表示为“/html/body/div/h1”。
在一些实施例的步骤S105中,根据多个节点路径从预设的样本集中获取第一目标路径,其中预设的样本集为预先收集好的多个样本路径,每个第一目标路径与某个节点路径相同或者相似,用于作为预训练模型的输入。需要说明的是,获取样本路径的方法可参考本申请实施例中的步骤S101至步骤S104,样本路径所对应的数据源可以是新闻类的网页,其包括舆情新闻和政策新闻等,舆情新闻和政策新闻皆对应有多个网页源,即有多个网页对应的源码数据。在实际应用中,为了提高网页数据提取的准确率,需要尽可能多地收集多个网页的源码数据。
在一些实施例的步骤S106中,将第一目标路径输入至预训练模型进行路径筛选处理,得到第二目标路径,其中,第二目标路径可以为一条,也可以为多条。需要说明的是,并不是每条第一目标路径都能解析出符合要求的网页数据,预训练模型其目的在于从第一目标路径中筛选出能够有效提取网页数据,例如网页正文的第二目标路径,保证根据第二目标路径所提取到的网页数据是对应于网页正文的数据。
在一些实施例的步骤S107中,根据第二目标路径从源码数据提取对应的目标网页数据。具体地,通过后序遍历的方式将第二路径还原出对应的DOM数,并按照顺序输出对应的目标网页数据,例如目标网页的正文、标题和时间等。
在一些实施例中,如图2所示,步骤S105具体包括但不限于步骤S201至步骤S202。
步骤S201,获取样本集;
步骤S202,从样本集中获取与节点路径相同的样本路径,作为第一目标路径。
在一些实施例的步骤S201中,获取提前采集好的样本集,其中样本集包括多个样本路径;样本路径用于构建预训练模型。在获取到目标网页的节点路径之后,首先需要与样本集中的样本路径进行匹配。
在一些实施例的步骤S202中,若在样本集中找到与节点路径相同的样本路径,则将该节点路径作为第一目标路径。若在样本集没有找到与节点路径相同的样本路径,则执行步骤S301至步骤S303。需要说明的是,在获取节点路径之后,不能直接将其作为预训练模型的输入,还需要在样本集中找到与每一节点路径相同或者相似的样本路径,即第一目标路径作为预训练模型的输入,由此才能保证预训练模型通过第一目标路径进行训练能够输出对应的网页数据。
在一些实施例中,如图3所示,在步骤S202之后,本申请实施例的网页数据的提取方法还包括但不限于步骤S301至步骤S303。
步骤S301,从多个节点路径中获取第一路径;
步骤S302,计算第一路径与每一样本路径之间的相似度;
步骤S303,将最大的相似度对应的样本路径,作为第一目标路径。
在一些实施例的步骤S301中,从多个节点路径中找出与样本集中的所有样本路径不同的第一路径。
在一些实施例的步骤S302中,对于每条第一路径,都需要计算第一路径与每条样本路径之前的相似度,其中相似度是通过将第一路径中对应的路径节点与每一样本路径中对应的样本节点进行比较得到的,计算所有第一路径与每一样本路径之间的相似度之后,得到多个相似度。
在一些实施例的步骤S303中,将最大的相似度对应的样本路径,作为第一目标路径,换句话说,从样本集中选取与该节点路径相似度最大的样本路径作为第一目标路径,并第一目标路径作为预训练模型的输入。需要说明的是,如果不根据节点路径从样本集中找到一个相似的第一目标路径进行替代,由于预训练模型预先没有对该节点路径进行训练,则可能导致提取不到对应的网页数据,从而影响网页数据提取的准确率。本申请实施例考虑到以上情形,在预训练模型的预测阶段,也就是实际利用预训练模型对第一目标路径进行筛选之前,考虑到节点路径与样本集的样本不对应的问题,将相似的节点进行替代得到第一目标路径,确保了预训练模型对于网页数据提取的有效性。
在一些实施例中,如图4所示,在步骤S106之前,本申请实施例的网页数据的提取方法还包括:构建预训练模型,具体包括但不限于步骤S401至步骤S404。
步骤S401,获取训练样本;
步骤S402,将样本序列和样本特征输入到原始训练模型;
步骤S403,根据样本序列和样本特征,对原始训练模型的损失函数进行计算,得到损失值;
步骤S404,根据损失值更新原始训练模型,得到预训练模型。
在一些实施例的步骤S401中,获取训练样本,其中训练样本包括样本序列和对应的样本特征。具体地,先采集样本数据,样本数据即采集多个网页的源码数据,接着将源码数据进行解析得到DOM树(包括一个父节点和对多个样本节点),并对DOM树进行遍历,得到每个DOM树对应的样本序列x
1,x
2,…,x
n,其具体的过程可参照上述实施例的步骤S101至步骤S104,在此不再赘述。在得到样本序列之后,还需要根据样本序列进行标注,得到样本特征。
具体地,本申请实施例最主要的样本特征为DOM树中的父节点(一般为html标签)到当前的样本节点的路径,例如当前的样本节点为x
1,x
1可能对应的标签序列为“/html/body/div/h1”,其中每一样本节点对应一个样本标签,上述举例中的样本节点x
1对应的样本标签则为“h1”。需要说明的是,每一个样本序列类比一句英文中的某个单词,而样本序列中的样本标签对应某个单词中的字母。
此外,本申请实施例的样本特征还可以为一些额外的特征,根据当前节点所对应的文本 数据提取出额外特征,例如标点符号的个数、虚词个数、是否包含“h1”标签,是否包含“p”标签,文本的向量表示等,其中额外特征对于样本节点的类别有较强的相关性,例如标题一般都在“h1”、“h2”等标签中;文本的向量表示可以通过使用开源的文本表示工具text2vec得到。
在一些实施例中,除了需要根据样本序列提取对应的样本特征,还需要对序列进行数据标注,具体地,如果设定预训练模型所提取到的网页数据为网页的标题、时间和正文,则在模型构建之前,就需要从样本序列中标注出标题、时间和正文三个字段,得到标注好的样本序列。在实际应用中,可以利用lable-studio等工具进行人工标注。
在一些实施例的步骤S402中,将样本序列和样本特征输入到原始训练模型。在本申请实施例中,所使用的原始训练模型的骨架为BiLSTM+CRF。
在一些实施例的步骤S403中,根据样本序列和样本特征,对原始训练模型的损失函数进行计算,得到损失值。在本申请实施例中,具体运用到的损失函数为CRF损失函数。
在一些实施例的步骤S404中,根据损失值更新原始训练模型,得到预训练模型。具体地,在训练过程中,修正原始训练模型的损失函数,使原始训练模型根据目标损失值进行训练,朝着新的目标优化,得到优化后的原始训练模型,也即本申请实施例提到的预训练模型。
在一些实施例中,如图5所示,步骤S403具体包括但不限于步骤S501至步骤S505。
步骤S501,样本序列进行编码处理得到序列向量,并对样本特征进行编码处理得到特征向量;
步骤S502,将序列向量和特征向量进行拼接,得到拼接向量;
步骤S503,根据预设的筛选率对拼接向量进行筛选处理,得到筛选向量;
步骤S504,根据预设的分类字段对筛选向量进行字段分类处理,得到对应的分类数据;
步骤S505,根据分类数据对原始训练模型的损失函数进行计算,得到损失值。
在一些实施例的步骤S501中,将样本序列进行编码处理得到序列向量,并对样本特征进行编码处理得到特征向量,具体地,可以通过原始训练模型的嵌入层将样本序列x
1,x
2,…,x
n映射到E(x
1),E(x
2),…,E(x
n),即序列向量。此外,还需要设置每个序列向量的维数,例如50或150等,其中维数是个先验选择。在实际应用中,不能将维数设置得过大,否则会导致过拟合,也不能将维数设置得过小,否则会导致欠拟合。
在一些实施例的步骤S502中,将特征向量拼接到序列向量E(x
n)中,得到拼接向量E
concat(x
i)。
在一些实施例的步骤S503中,将拼接向量E
concat(x
i)输入至原始训练模型的dropout层,dropout层根据筛选率对拼接向量进行筛选处理,得到筛选向量。具体地,dropout层根据筛选率随机将某些神经元置为0,该步骤起到正则化的作用。
在一些实施例的步骤S504中,根据预设的分类字段对筛选向量进行字段分类处理,得到对应的分类数据。具体地,将步骤S503得到的筛选向量输入到原始训练模型的BiLSTM层中,并设置好BlLSTM的隐藏层的维数,例如150,然后接入另一个设定好筛选率的dropout层,之后用一个全连接层对之前的向量进行拼接处理后,进入CRF层,其中CRF层就根据预设的分类字段,以及提前对样本序列所标注的信息,例如标题、时间和正文,输出三个类别的分类数据。
在一些实施例的步骤S505中,根据分类数据对原始训练模型的损失函数进行计算,得到损失值。其中原始训练模型的损失函数可选择为CRF损失函数,计算得到损失值后进行反向传播来调整原始训练模型中各神经网络的权重,从而得到训练好的预训练模型。
在实际应用中,计算损失函数对于各个参数的梯度,然后根据参数的梯度值,结合学习率按照优化器设定的规则更新参数。具体使用Adam优化器对原始训练模型进行训练,可以将样本数量设置为32,学习率设置为0.001,同时使用R-Drop技术给原始训练模型加上惩罚项。
在一些实施例中,对网页的信息进行分析后,会发现网页中包含大量与网页主题无关的噪声内容,如版权信息、广告链接和导航栏等,在进行网页数据提取的过程中,这些网页噪 声会影响提取的效果,因此需要通过去噪的方式对网页进行预处理。
在一些实施例中,样本序列包括父节点和多个样本节点,每一样本节点包括网页标签;如图6所示,在步骤S402之前,本申请实施例的网页数据的提取方法还包括但不限于步骤S601至步骤S604。
步骤S601,获取样本序列的多个样本路径;
步骤S602,获取预设的无关标签;
步骤S603,根据无关标签,从多个样本路径中获取第二路径;
步骤S604,删除第二路径所对应的样本节点,以更新样本序列。
在一些实施例的步骤S601中,获取样本序列的多个样本路径,其中每一样本路径为每一样本节点到父节点的路径。
在一些实施例的步骤S602中,获取预设的无关标签,其中无关标签指的是与网页数据提取所不相关的标签,例如用于表示图像的“img”标签、用于定义客户端脚本的“script”标签、用于表示视频的“video”以及注释标签等。
在一些实施例的步骤S603中,对于每一条样本路径,都需要样本路径下的每个网页标签是否为无关标签,如果一条样本路径下的一个或多个网页标签为无关标签,将该样本路径标记为第二路径。
在一些实施例的步骤S604中,删除第二路径所对应的样本节点,以更新样本序列。由于无关标签与网页主题内容的相关性很低,在对原始训练模型进行训练之前将这部分内容过滤掉,去掉无关的噪声内容,从而提高提取网页数据的准确率。
在一些实施例中,如图7所示,在步骤S107之后,本申请实施例的网页数据的提取方法还包括但不限于步骤S701至步骤S703。
S701,获取目标网页数据中的网页时间数据;
S702,根据预设的数据格式对网页时间数据进行标准化处理,得到标准时间数据;
S703,根据标准时间数据更新网页时间数据。
在一些实施例的步骤S701中,获取目标网页数据中的表示时间的网页时间数据。
在一些实施例的步骤S702中,根据预设的数据格式对网页时间数据进行标准化处理,得到标准时间数据,例如预设的数据格式为“年/月/日”,所提取到是网页时间数据为“2021-10-24 17:12:00”,需要按照“年/月/日”的数据格式将网页时间数据进行调整,得到标准时间数据,即“2021/10/24”。
在一些实施例的步骤S703中,将网页时间数据更新为标准时间数据,本申请实施例通过对网页时间数据进行标准化处理,便于后续进行数据库的保存。
在一些实施例中,本申请实施例除了利用BiLSTM+CRF模型的方法对网页数据进行提取,还结合了开源的GNE模块对网页数据进行提取。在BiLSTM+CRF模型的基础上结合GNE模块对网页数据进行提取的目的是,防止BiLSTM+CRF模型所提取到的网页数据不全面,例如提取不到正文数据,此时可通过GNE模块提取相应的正文数据,由此保证能够完整提取目标网页对应的网页数据。本申请实施例通过结合传统的统计方法以及基于深度学习的方法,进一步提高了网页数据提取的精度。
本申请实施例提出的网页数据的提取方法,通过获取目标网页的源码数据,对源码数据进行解析得到对应的DOM树;对DOM树进行遍历处理,得到对应的节点序列,其中节点序列包括根节点和多个标签节点;获取节点序列的多个节点路径,其中每一节点路径为每一标签节点到根节点的路径;根据多个节点路径从预设的样本集中获取第一目标路径,将第一目标路径输入至预训练模型进行路径筛选处理,得到第二目标路径;根据第二目标路径从源码数据提取对应的目标网页数据。本申请实施例通过预训练模型分析第一目标路径中的标签节点情况,从而可以基于同一预训练模型根据同一类型的网页从第一目标路径筛选出第二目标路径,通过第二目标路径就能从源码数据直接提取到目标网页数据,不需要人工构建专门的路径模板,从而提高网页数据的提取效率。
本申请实施例还提供一种网页数据的提取装置,如图8所示,可以实现上述网页数据的提取方法,该网页数据的提取装置包括:第一获取模块801、数据解析模块802、遍历模块803、第二获取模块804、第三获取模块805、路径筛选模块806和数据提取模块807,第一获取模块801用于获取目标网页的源码数据;数据解析模块802用于对源码数据进行解析处理,得到对应的DOM树;遍历模块803用于对DOM树进行遍历处理,得到对应的节点序列;其中节点序列包括根节点和多个标签节点;第二获取模块804用于获取节点序列的多个节点路径;其中每一节点路径为每一标签节点到根节点的路径;第三获取模块805用于根据多个节点路径从预设的样本集中获取第一目标路径;路径筛选模块806用于将第一目标路径输入至预训练模型进行路径筛选处理,得到第二目标路径;数据提取模块807根据第二目标路径从源码数据提取对应的目标网页数据。
本申请实施例的网页数据的提取装置用于执行上述实施例中的网页数据的提取方法,其具体处理过程与上述实施例中的网页数据的提取方法相同,此处不再一一赘述。
本申请实施例还提供了一种计算机设备,包括:
至少一个处理器,以及,
与至少一个处理器通信连接的存储器;其中,
存储器存储有指令,指令被至少一个处理器执行,以使至少一个处理器执行指令时实现一种网页数据的提取方法,其中,所述网页数据的提取方法包括:
获取目标网页的源码数据;
对源码数据进行解析处理,得到对应的DOM树;
对DOM树进行遍历处理,得到对应的节点序列;其中,节点序列包括根节点和多个标签节点;
获取节点序列的多个节点路径;其中,每一节点路径为每一标签节点到根节点的路径;
根据多个节点路径从预设的样本集中获取第一目标路径;
将第一目标路径输入至预训练模型进行路径筛选处理,得到第二目标路径;
根据第二目标路径从源码数据提取对应的目标网页数据。
下面结合图9对计算机设备的硬件结构进行详细说明。该计算机设备包括:处理器901、存储器902、输入/输出接口903、通信接口904和总线905。
处理器901,可以采用通用的中央处理器(Central Processin Unit,CPU)、微处理器、应用专用集成电路(Application Specific Integrated Circuit,ASIC)、或者一个或多个集成电路等方式实现,用于执行相关程序,以实现本申请实施例所提供的技术方案;
存储器902,可以采用只读存储器(Read Only Memory,ROM)、静态存储设备、动态存储设备或者随机存取存储器(Random Access Memory,RAM)等形式实现。存储器902可以存储操作系统和其他应用程序,在通过软件或者固件来实现本说明书实施例所提供的技术方案时,相关的程序代码保存在存储器902中,并由处理器901来调用执行本申请实施例的网页数据的提取方法;
输入/输出接口903,用于实现信息输入及输出;
通信接口904,用于实现本设备与其他设备的通信交互,可以通过有线方式(例如USB、网线等)实现通信,也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信;和
总线905,在设备的各个组件(例如处理器901、存储器902、输入/输出接口903和通信接口904)之间传输信息;
其中处理器901、存储器902、输入/输出接口903和通信接口904通过总线905实现彼此之间在设备内部的通信连接。
本申请实施例还提供一种存储介质,该存储介质是计算机可读存储介质,该计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令用于使计算机执行一种网页数据的提取方法,其中,所述网页数据的提取方法包括:
获取目标网页的源码数据;
对源码数据进行解析处理,得到对应的DOM树;
对DOM树进行遍历处理,得到对应的节点序列;其中,节点序列包括根节点和多个标签节点;
获取节点序列的多个节点路径;其中,每一节点路径为每一标签节点到根节点的路径;
根据多个节点路径从预设的样本集中获取第一目标路径;
将第一目标路径输入至预训练模型进行路径筛选处理,得到第二目标路径;
根据第二目标路径从源码数据提取对应的目标网页数据。
所述计算机可读存储介质可以是非易失性,也可以是易失性。存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外,存储器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至该处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
本申请实施例提出的网页数据的提取方法、网页数据的提取装置、计算机设备、存储介质,通过获取目标网页的源码数据,对源码数据进行解析得到对应的DOM树;对DOM树进行遍历处理,得到对应的节点序列,并获取节点序列的多个节点路径,其中每一节点路径为节点序列的每一标签节点到根节点的路径,并根据多个节点路径从预设的样本集中获取第一目标路径,将第一目标路径输入至预训练模型进行路径筛选处理,得到第二目标路径,并根据第二目标路径从源码数据提取对应的目标网页数据;从而可以通过预训练模型分析第一目标路径中的标签节点情况,以便于基于同一预训练模型根据同一类型的网页从第一目标路径筛选出第二目标路径,通过第二目标路径就能从源码数据直接提取到目标网页数据,不需要人工构建专门的路径模板,从而提高网页数据的提取效率。
本申请实施例描述的实施例是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域技术人员可知,随着技术的演变和新应用场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本领域技术人员可以理解的是,图1至图7中示出的技术方案并不构成对本申请实施例的限定,可以包括比图示更多或更少的步骤,或者组合某些步骤,或者不同的步骤。
以上参照附图说明了本申请实施例的优选实施例,并非因此局限本申请实施例的权利范围。本领域技术人员不脱离本申请实施例的范围和实质内所作的任何修改、等同替换和改进,均应在本申请实施例的权利范围之内。
Claims (20)
- 一种网页数据的提取方法,其中,包括:获取目标网页的源码数据;对所述源码数据进行解析处理,得到对应的DOM树;对所述DOM树进行遍历处理,得到对应的节点序列;其中,所述节点序列包括根节点和多个标签节点;获取所述节点序列的多个节点路径;其中,每一所述节点路径为每一所述标签节点到所述根节点的路径;根据多个所述节点路径从预设的样本集中获取第一目标路径;将所述第一目标路径输入至预训练模型进行路径筛选处理,得到第二目标路径;根据所述第二目标路径从所述源码数据提取对应的目标网页数据。
- 根据权利要求1所述的方法,其中,所述根据多个所述节点路径从预设的样本集中获取第一目标路径,包括:获取所述样本集;其中,所述样本集包括多个样本路径;从所述样本集中获取与所述节点路径相同的样本路径,作为所述第一目标路径。
- 根据权利要求2所述的方法,其中,所述从所述样本集中获取与所述节点路径相同的样本路径,作为所述第一目标路径之后,所述方法还包括:从多个所述节点路径中获取第一路径;其中,所述第一路径与所述样本集中的每一所述样本路径不同;计算所述第一路径与每一所述样本路径之间的相似度;将最大的相似度对应的样本路径,作为所述第一目标路径。
- 根据权利要求1所述的方法,其中,在所述将所述第一目标路径输入至预训练模型进行路径筛选处理,得到第二目标路径之前,所述方法还包括:构建所述预训练模型,具体包括:获取训练样本;其中,所述训练样本包括样本序列和对应的样本特征;将所述样本序列和所述样本特征输入到原始训练模型;根据所述样本序列和所述样本特征,对所述原始训练模型的损失函数进行计算,得到损失值;根据所述损失值更新所述原始训练模型,得到所述预训练模型。
- 根据权利要求4所述的方法,其中,所述根据所述样本序列和所述样本特征,对所述原始训练模型的损失函数进行计算,得到损失值,包括:对所述样本序列进行编码处理得到序列向量,并对所述样本特征进行编码处理得到特征向量;将所述序列向量和所述特征向量进行拼接,得到拼接向量;根据预设的筛选率对所述拼接向量进行筛选处理,得到筛选向量;根据预设的分类字段对所述筛选向量进行字段分类处理,得到对应的分类数据;根据所述分类数据对所述原始训练模型的所述损失函数进行计算,得到所述损失值。
- 根据权利要求4所述的方法,其中,所述样本序列包括父节点和多个样本节点,每一所述样本节点包括网页标签;在所述将所述样本序列和所述样本特征输入到原始训练模型之前,所述方法还包括:更新所述样本序列,具体包括:获取所述样本序列的多个样本路径;其中,每一所述样本路径为每一所述样本节点到所述父节点的路径;获取预设的无关标签;根据所述无关标签,从多个所述样本路径中获取第二路径;其中,所述第二路径下的至少一个所述网页标签与所述无关标签相同;删除所述第二路径所对应的样本节点,以更新所述样本序列。
- 根据权利要求1至6任一项所述的方法,其中,在所述根据所述第二目标路径从所述源码数据提取对应的目标网页数据之后,所述方法还包括:获取所述目标网页数据中的网页时间数据;根据预设的数据格式对所述网页时间数据进行标准化处理,得到标准时间数据;根据所述标准时间数据更新所述网页时间数据。
- 一种网页数据的提取装置,其中,包括:第一获取模块:用于获取目标网页的源码数据;数据解析模块:用于对所述源码数据进行解析处理,得到对应的DOM树;遍历模块:用于对所述DOM树进行遍历处理,得到对应的节点序列;其中,所述节点序列包括根节点和多个标签节点;第二获取模块:用于获取所述节点序列的多个节点路径;其中,每一所述节点路径为每一所述标签节点到所述根节点的路径;第三获取模块:用于根据多个所述节点路径从预设的样本集中获取第一目标路径;路径筛选模块:用于将所述第一目标路径输入至预训练模型进行路径筛选处理,得到第二目标路径;数据提取模块:根据所述第二目标路径从所述源码数据提取对应的目标网页数据。
- 一种计算机设备,其中,所述计算机设备包括存储器和处理器,其中,所述存储器中存储有计算机程序,所述计算机程序被所述处理器执行时,所述处理器用于执行一种网页数据的提取方法,其中,所述网页数据的提取方法包括:获取目标网页的源码数据;对所述源码数据进行解析处理,得到对应的DOM树;对所述DOM树进行遍历处理,得到对应的节点序列;其中,所述节点序列包括根节点和多个标签节点;获取所述节点序列的多个节点路径;其中,每一所述节点路径为每一所述标签节点到所述根节点的路径;根据多个所述节点路径从预设的样本集中获取第一目标路径;将所述第一目标路径输入至预训练模型进行路径筛选处理,得到第二目标路径;根据所述第二目标路径从所述源码数据提取对应的目标网页数据。
- 根据权利要求9所述的一种计算机设备,其中,所述根据多个所述节点路径从预设的样本集中获取第一目标路径,包括:获取所述样本集;其中,所述样本集包括多个样本路径;从所述样本集中获取与所述节点路径相同的样本路径,作为所述第一目标路径。
- 根据权利要求10所述的一种计算机设备,其中,所述从所述样本集中获取与所述节点路径相同的样本路径,作为所述第一目标路径之后,所述方法还包括:从多个所述节点路径中获取第一路径;其中,所述第一路径与所述样本集中的每一所述样本路径不同;计算所述第一路径与每一所述样本路径之间的相似度;将最大的相似度对应的样本路径,作为所述第一目标路径。
- 根据权利要求9所述的一种计算机设备,其中,在所述将所述第一目标路径输入至预训练模型进行路径筛选处理,得到第二目标路径之前,所述方法还包括:构建所述预训练模型,具体包括:获取训练样本;其中,所述训练样本包括样本序列和对应的样本特征;将所述样本序列和所述样本特征输入到原始训练模型;根据所述样本序列和所述样本特征,对所述原始训练模型的损失函数进行计算,得到损失值;根据所述损失值更新所述原始训练模型,得到所述预训练模型。
- 根据权利要求12所述的一种计算机设备,其中,所述根据所述样本序列和所述样本特征,对所述原始训练模型的损失函数进行计算,得到损失值,包括:对所述样本序列进行编码处理得到序列向量,并对所述样本特征进行编码处理得到特征向量;将所述序列向量和所述特征向量进行拼接,得到拼接向量;根据预设的筛选率对所述拼接向量进行筛选处理,得到筛选向量;根据预设的分类字段对所述筛选向量进行字段分类处理,得到对应的分类数据;根据所述分类数据对所述原始训练模型的所述损失函数进行计算,得到所述损失值。
- 根据权利要求12所述的一种计算机设备,其中,所述样本序列包括父节点和多个样本节点,每一所述样本节点包括网页标签;在所述将所述样本序列和所述样本特征输入到原始训练模型之前,所述方法还包括:更新所述样本序列,具体包括:获取所述样本序列的多个样本路径;其中,每一所述样本路径为每一所述样本节点到所述父节点的路径;获取预设的无关标签;根据所述无关标签,从多个所述样本路径中获取第二路径;其中,所述第二路径下的至少一个所述网页标签与所述无关标签相同;删除所述第二路径所对应的样本节点,以更新所述样本序列。
- 一种存储介质,所述存储介质为计算机可读存储介质,其中,所述计算机可读存储有计算机程序,在所述计算机程序被计算机执行时,所述计算机用于执行一种网页数据的提取方法,其中,所述网页数据的提取方法包括:获取目标网页的源码数据;对所述源码数据进行解析处理,得到对应的DOM树;对所述DOM树进行遍历处理,得到对应的节点序列;其中,所述节点序列包括根节点和多个标签节点;获取所述节点序列的多个节点路径;其中,每一所述节点路径为每一所述标签节点到所述根节点的路径;根据多个所述节点路径从预设的样本集中获取第一目标路径;将所述第一目标路径输入至预训练模型进行路径筛选处理,得到第二目标路径;根据所述第二目标路径从所述源码数据提取对应的目标网页数据。
- 根据权利要求15所述的一种存储介质,其中,所述根据多个所述节点路径从预设的样本集中获取第一目标路径,包括:获取所述样本集;其中,所述样本集包括多个样本路径;从所述样本集中获取与所述节点路径相同的样本路径,作为所述第一目标路径。
- 根据权利要求16所述的一种存储介质,其中,所述从所述样本集中获取与所述节点路径相同的样本路径,作为所述第一目标路径之后,所述方法还包括:从多个所述节点路径中获取第一路径;其中,所述第一路径与所述样本集中的每一所述样本路径不同;计算所述第一路径与每一所述样本路径之间的相似度;将最大的相似度对应的样本路径,作为所述第一目标路径。
- 根据权利要求15所述的一种存储介质,其中,在所述将所述第一目标路径输入至预训练模型进行路径筛选处理,得到第二目标路径之前,所述方法还包括:构建所述预训练模型,具体包括:获取训练样本;其中,所述训练样本包括样本序列和对应的样本特征;将所述样本序列和所述样本特征输入到原始训练模型;根据所述样本序列和所述样本特征,对所述原始训练模型的损失函数进行计算,得到损 失值;根据所述损失值更新所述原始训练模型,得到所述预训练模型。
- 根据权利要求18所述的一种存储介质,其中,所述根据所述样本序列和所述样本特征,对所述原始训练模型的损失函数进行计算,得到损失值,包括:对所述样本序列进行编码处理得到序列向量,并对所述样本特征进行编码处理得到特征向量;将所述序列向量和所述特征向量进行拼接,得到拼接向量;根据预设的筛选率对所述拼接向量进行筛选处理,得到筛选向量;根据预设的分类字段对所述筛选向量进行字段分类处理,得到对应的分类数据;根据所述分类数据对所述原始训练模型的所述损失函数进行计算,得到所述损失值。
- 根据权利要求18所述的一种存储介质,其中,所述样本序列包括父节点和多个样本节点,每一所述样本节点包括网页标签;在所述将所述样本序列和所述样本特征输入到原始训练模型之前,所述方法还包括:更新所述样本序列,具体包括:获取所述样本序列的多个样本路径;其中,每一所述样本路径为每一所述样本节点到所述父节点的路径;获取预设的无关标签;根据所述无关标签,从多个所述样本路径中获取第二路径;其中,所述第二路径下的至少一个所述网页标签与所述无关标签相同;删除所述第二路径所对应的样本节点,以更新所述样本序列。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210143571.5A CN114491325A (zh) | 2022-02-16 | 2022-02-16 | 网页数据的提取方法和装置、计算机设备、存储介质 |
CN202210143571.5 | 2022-02-16 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023155303A1 true WO2023155303A1 (zh) | 2023-08-24 |
Family
ID=81482466
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/090719 WO2023155303A1 (zh) | 2022-02-16 | 2022-04-29 | 网页数据的提取方法和装置、计算机设备、存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN114491325A (zh) |
WO (1) | WO2023155303A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117407615A (zh) * | 2023-10-27 | 2024-01-16 | 北京数立得科技有限公司 | 一种基于强化学习的Web信息抽取方法及系统 |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116049597B (zh) * | 2023-01-10 | 2024-04-19 | 北京百度网讯科技有限公司 | 网页的多任务模型的预训练方法、装置及电子设备 |
CN116108235B (zh) * | 2023-02-20 | 2023-11-10 | 上海安博通信息科技有限公司 | 一种树形结构的路径获取方法、装置以及处理设备 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130014002A1 (en) * | 2011-06-15 | 2013-01-10 | Alibaba Group Holding Limited | Method and System of Extracting Web Page Information |
CN108733405A (zh) * | 2017-04-13 | 2018-11-02 | 富士通株式会社 | 训练网页分布式表示模型的方法和装置 |
CN111966831A (zh) * | 2020-08-18 | 2020-11-20 | 创新奇智(上海)科技有限公司 | 一种模型训练方法、文本分类方法、装置及网络模型 |
CN112667940A (zh) * | 2020-10-15 | 2021-04-16 | 广东电子工业研究院有限公司 | 基于深度学习的网页正文抽取方法 |
CN112732994A (zh) * | 2021-01-07 | 2021-04-30 | 上海携宁计算机科技股份有限公司 | 网页信息的提取方法、装置、设备及存储介质 |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111339396B (zh) * | 2018-12-18 | 2024-04-16 | 富士通株式会社 | 提取网页内容的方法、装置和计算机存储介质 |
-
2022
- 2022-02-16 CN CN202210143571.5A patent/CN114491325A/zh active Pending
- 2022-04-29 WO PCT/CN2022/090719 patent/WO2023155303A1/zh unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130014002A1 (en) * | 2011-06-15 | 2013-01-10 | Alibaba Group Holding Limited | Method and System of Extracting Web Page Information |
CN108733405A (zh) * | 2017-04-13 | 2018-11-02 | 富士通株式会社 | 训练网页分布式表示模型的方法和装置 |
CN111966831A (zh) * | 2020-08-18 | 2020-11-20 | 创新奇智(上海)科技有限公司 | 一种模型训练方法、文本分类方法、装置及网络模型 |
CN112667940A (zh) * | 2020-10-15 | 2021-04-16 | 广东电子工业研究院有限公司 | 基于深度学习的网页正文抽取方法 |
CN112732994A (zh) * | 2021-01-07 | 2021-04-30 | 上海携宁计算机科技股份有限公司 | 网页信息的提取方法、装置、设备及存储介质 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117407615A (zh) * | 2023-10-27 | 2024-01-16 | 北京数立得科技有限公司 | 一种基于强化学习的Web信息抽取方法及系统 |
Also Published As
Publication number | Publication date |
---|---|
CN114491325A (zh) | 2022-05-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023155303A1 (zh) | 网页数据的提取方法和装置、计算机设备、存储介质 | |
CN111783394A (zh) | 事件抽取模型的训练方法、事件抽取方法和系统及设备 | |
WO2023108993A1 (zh) | 基于深度聚类算法的产品推荐方法、装置、设备及介质 | |
US20100211533A1 (en) | Extracting structured data from web forums | |
CN103559199B (zh) | 网页信息抽取方法和装置 | |
CN112667940B (zh) | 基于深度学习的网页正文抽取方法 | |
CN112183056A (zh) | 基于CNN-BiLSTM框架的上下文依赖的多分类情感分析方法和系统 | |
CN115827819A (zh) | 一种智能问答处理方法、装置、电子设备及存储介质 | |
CN116661805B (zh) | 代码表示的生成方法和装置、存储介质及电子设备 | |
CN112287272A (zh) | 一种网站列表页面的分类方法、系统及存储介质 | |
CN116303996B (zh) | 基于多焦点图神经网络的主题事件抽取方法 | |
CN103092973B (zh) | 信息抽取方法和装置 | |
CN109299286A (zh) | 非结构化数据的知识挖掘方法及系统 | |
Watson | Scripting intelligence: Web 3.0 information gathering and processing | |
CN118260464A (zh) | 一种提取网页中感兴趣文本的方法和装置 | |
Yu et al. | Web content information extraction based on DOM tree and statistical information | |
CN111061975B (zh) | 一种页面中无关内容的处理方法、装置 | |
CN112632421B (zh) | 一种自适应结构化的文档抽取方法 | |
CN114625658A (zh) | App稳定性测试方法、装置、设备和计算机可读存储介质 | |
CN113806667A (zh) | 一种支持网页分类的方法和系统 | |
CN113157857A (zh) | 面向新闻的热点话题检测方法、装置及设备 | |
CN112015891A (zh) | 基于深度神经网络的网络问政平台留言分类的方法及系统 | |
Lim et al. | Generalized and lightweight algorithms for automated web forum content extraction | |
CN111078947A (zh) | 基于xml的领域要素提取配置语言系统 | |
CN118349756B (zh) | 一种基于源码结构和资源链接的不良网站识别方法及系统 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22926616 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |