CN115525730B - Webpage content extraction method and device based on page weighting and electronic equipment - Google Patents

Webpage content extraction method and device based on page weighting and electronic equipment Download PDF

Info

Publication number
CN115525730B
CN115525730B CN202210184453.9A CN202210184453A CN115525730B CN 115525730 B CN115525730 B CN 115525730B CN 202210184453 A CN202210184453 A CN 202210184453A CN 115525730 B CN115525730 B CN 115525730B
Authority
CN
China
Prior art keywords
webpage
text
search
weight
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210184453.9A
Other languages
Chinese (zh)
Other versions
CN115525730A (en
Inventor
吴佳鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Vision Digital Technology Co ltd
Original Assignee
Shandong Vision Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Vision Digital Technology Co ltd filed Critical Shandong Vision Digital Technology Co ltd
Priority to CN202210184453.9A priority Critical patent/CN115525730B/en
Publication of CN115525730A publication Critical patent/CN115525730A/en
Application granted granted Critical
Publication of CN115525730B publication Critical patent/CN115525730B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of front ends, and discloses a webpage content extraction method, a webpage content extraction device and electronic equipment, wherein the webpage content extraction method comprises the following steps of: and constructing a Web graph according to access links included in the search webpage set, extracting a preset number of search webpages from the search webpage set according to the Web graph to obtain a content webpage set to be extracted, performing OCR (optical character recognition) on each content webpage to be extracted in the content webpage set to be extracted to obtain a first text set to be corrected, recognizing texts of the content webpage set to be extracted by using a text recognition model which is completed through pre-training to obtain a second text set to be corrected, and performing correction on the first text set to be corrected and the second text set to be corrected to obtain webpage contents. The invention can solve the problems that the efficiency of identifying the webpage content is lower and the identification accuracy of OCR technology is required to be further improved when the webpage content is too much.

Description

Webpage content extraction method and device based on page weighting and electronic equipment
Technical Field
The present invention relates to the field of front-end technologies, and in particular, to a method and apparatus for extracting web page content based on page weighting, an electronic device, and a computer readable storage medium.
Background
With technological development, information sharing approaches are increasing, wherein sharing information based on web pages is the main mode at present. However, the web page information generally has only a browsing function, and because the user needs to download the information in the web page for scientific research, data analysis and the like, only browsing the web page can not meet the requirements of part of the users, and the user needs to further identify the content in the web page.
The current mainstream web page content recognition method is mainly based on the OCR technology, the OCR technology can recognize web page content to a large extent, but when the web pages are too many, the time effect is greatly affected by using the OCR technology to recognize the web page content one by one, and in addition, the accuracy of OCR recognition of the web page content needs to be further improved.
Disclosure of Invention
The invention provides a webpage content extraction method and device based on page weighting and a computer readable storage medium, and mainly aims to solve the problems that webpage content recognition efficiency is low and OCR recognition technology recognition accuracy is required to be further improved when webpage content is too much.
In order to achieve the above object, the present invention provides a method for extracting web page content based on page weighting, comprising:
starting a search engine and receiving keywords, and searching a search webpage set related to the keywords in the search engine;
setting the same initial weight for each search webpage in the search webpage set, and constructing a Web graph according to access links included in each search webpage in the search webpage set;
According to the Web graph, sequentially calculating the update weight of each search webpage, and updating the initial weight according to the update weight to obtain the historical weight corresponding to each search webpage;
Sorting the search webpage sets according to the historical weights, and extracting a preset number of search webpages according to the ranks to obtain webpage sets with contents to be extracted;
Performing OCR (optical character recognition) on each content to-be-extracted webpage in the content to-be-extracted webpage set to obtain a first to-be-corrected text set;
Identifying the text of the webpage set to be extracted from the content by using the text identification model which is completed through pre-training, so as to obtain a second text set to be corrected;
and correcting the first text set to be corrected and the second text set to be corrected to obtain webpage content.
Optionally, performing OCR recognition on each content to-be-extracted web page in the set of content to-be-extracted web pages to obtain a first set of text to be corrected, including:
Scanning the webpage set to be extracted of the content to obtain a webpage image set;
Performing text recognition on the webpage image set to obtain text information, and calculating text confidence of the text information;
and clearing the text information according to the text confidence degree to obtain the first text set to be corrected.
Optionally, the clearing the text information according to the text confidence coefficient to obtain the first text set to be corrected includes:
Setting a text confidence threshold;
And when the text confidence coefficient is lower than the text confidence coefficient threshold value, eliminating the corresponding text until the first text set to be corrected is obtained.
Optionally, the pre-trained text recognition model includes:
Receiving an original text set and an original BERT language model, and executing masking operation on the original text set according to a preset percentage to obtain a masking text set;
performing classification training on the original BERT language model according to a preset probability by using the mask text set to obtain a trained BERT language model;
and fine-tuning the trained BERT language model to obtain the text recognition model.
Optionally, the fine tuning the trained BERT language model to obtain the text recognition model includes:
Receiving a fine tuning text set, wherein the fine tuning text set comprises a correct sentence set and a corresponding error sentence set, inputting the fine tuning text set into the trained BERT language model, and executing fine tuning on the trained BERT language model by utilizing a pre-built fine tuning method and the correct sentence set to generate a sentence-to-word fine tuning model;
Masking the error words by using the error words in the error sentence set to obtain masked error words, extracting correct words corresponding to the error words from the correct sentence set, setting the correct words as prediction targets, and fine-tuning the trained BERT language model by using the prediction targets and the masked error words to obtain an error sentence error word fine-tuning model;
Masking the correct words by using the correct words in the error sentence set to obtain masked correct words, setting the correct words as a prediction target, and fine-tuning the trained BERT language model by using the prediction target and the masked correct words to obtain a fine-tuning model of the error sentence to word;
And obtaining the text recognition model based on the sentence-to-word fine adjustment model, the sentence-to-word fine adjustment model and the sentence-to-word fine adjustment model.
Optionally, the searching the search engine for the set of search web pages related to the keyword includes:
searching out a webpage database corresponding to the search engine, and extracting webpage labels included in the webpage database to obtain a plurality of groups of webpage label sets;
calculating the text distance between the keyword and each webpage label set;
and screening the search web pages with the text distance smaller than the specified threshold value to obtain the search web page set.
Optionally, the extracting the web page tags included in the web page database to obtain a plurality of sets of web page tag sets includes:
Extracting webpage keywords of each webpage in the webpage database in sequence to obtain a webpage keyword set;
Performing stop word removal processing on each webpage keyword in the webpage keyword set to obtain a core keyword set;
and recombining each core keyword to obtain a webpage label set corresponding to the webpage.
In order to solve the above problems, the present invention further provides a device for extracting web page content based on page weighting, the device comprising:
The search web page construction module is used for starting a search engine and receiving keywords, and searching a search web page set related to the keywords in the search engine;
the historical weight calculation module is used for setting the same initial weight for each search webpage in the search webpage set, constructing a Web graph according to access links included in each search webpage in the search webpage set, sequentially calculating the update weight of each search webpage according to the Web graph, and updating the initial weight according to the update weight to obtain the historical weight corresponding to each search webpage;
The webpage ranking module is used for performing ranking on the search webpage sets according to the historical weights, extracting a preset number of search webpages according to the ranking, and obtaining a webpage set with contents to be extracted;
the OCR recognition module is used for executing OCR recognition on each content to-be-extracted webpage in the content to-be-extracted webpage set to obtain a first to-be-corrected text set;
And the webpage content extraction module is used for identifying texts of the webpage set to be extracted by utilizing the text identification model which is completed through pre-training to obtain a second text set to be corrected, and correcting the first text set to be corrected and the second text set to be corrected to obtain webpage content.
In order to solve the above-mentioned problems, the present invention also provides an electronic apparatus including:
A memory storing at least one instruction; and
And the processor executes the instructions stored in the memory to realize the webpage content extraction method based on the page weighting.
In order to solve the above-mentioned problems, the present invention also provides a computer readable storage medium having at least one instruction stored therein, the at least one instruction being executed by a processor in an electronic device to implement the above-mentioned page weighting-based web page content extraction method.
Compared with the background art, the method comprises the following steps: when the web page content is too much, the recognition efficiency of the web page content and the problem that the OCR recognition accuracy is to be improved can be directly reduced. The method comprises the steps of firstly executing webpage eliminating operation, specifically, after a search engine is started, searching a search webpage set related to keywords in the search engine, wherein the number of webpages of the search webpage set is huge, so that the same initial weight is set for each search webpage in the search webpage set, a Web graph is constructed according to access links included in each search webpage in the search webpage set, the update weight of each search webpage is calculated according to the Web graph in sequence, the initial weight is updated according to the update weight, historical weights corresponding to each search webpage are obtained, further, sorting is performed on the search webpage set according to the historical weights, and a preset number of search webpages are extracted according to ranking, so that a webpage set to be extracted is obtained. In addition, in order to improve the accuracy of OCR recognition, the embodiment of the invention further introduces a text recognition model which is finished by pre-training, recognizes the text of the webpage set to be extracted from the content, and improves the accuracy of recognizing the webpage content through the text recognition model and the double recognition result of OCR. Therefore, the webpage content extraction method, the webpage content extraction device, the electronic equipment and the computer readable storage medium based on the page weighting can solve the problems that the efficiency of identifying the webpage content is low and the identification accuracy of OCR technology is to be further improved when the webpage content is too much.
Drawings
FIG. 1 is a flow chart of a method for extracting web page content based on page weighting according to an embodiment of the present invention;
FIG. 2 is a detailed flow chart of one of the steps shown in FIG. 1;
FIG. 3 is a detailed flow chart of another step of FIG. 1;
FIG. 4 is a functional block diagram of a device for extracting content of a web page based on page weighting according to an embodiment of the present invention;
Fig. 5 is a schematic structural diagram of an electronic device for implementing the method for extracting web page content based on page weighting according to an embodiment of the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The embodiment of the application provides a webpage content extraction method based on page weighting. The execution main body of the webpage content extraction method based on the page weighting comprises at least one of a server, a terminal and the like which can be configured to execute the method provided by the embodiment of the application. In other words, the method for extracting web page content based on page weighting may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The service end includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
Referring to fig. 1, a flow chart of a method for extracting web page content based on page weighting according to an embodiment of the invention is shown. In this embodiment, the method for extracting web page content based on page weighting includes:
s1, starting a search engine and receiving keywords, and searching a search webpage set related to the keywords in the search engine.
It should be explained that the search engine is a server that intelligently matches out web pages based on text entered by the user. In the embodiment of the invention, the search engine exposes a user dialog box, the user inputs keywords in the user dialog box, and the search engine matches the search webpage set according to the keywords.
In detail, the searching the search engine for the search web page set related to the keyword includes:
searching out a webpage database corresponding to the search engine, and extracting webpage labels included in the webpage database to obtain a plurality of groups of webpage label sets;
calculating the text distance between the keyword and each webpage label set;
and screening the search web pages with the text distance smaller than the specified threshold value to obtain the search web page set.
It should be explained that the web page database is used for serving a search engine, and the web address and the web page keywords of each web page are stored in the web page database, wherein the combination of the web page keywords can obtain the web page tag.
In detail, the extracting the web page tags included in the web page database to obtain a plurality of sets of web page tag sets includes:
Extracting webpage keywords of each webpage in the webpage database in sequence to obtain a webpage keyword set;
Performing stop word removal processing on each webpage keyword in the webpage keyword set to obtain a core keyword set;
and recombining each core keyword to obtain a webpage label set corresponding to the webpage.
For example, if the user inputs "deep learning and artificial intelligence", the web page tag in the web page database is first utilized to calculate to obtain the web page tag of the web page a as "front end development", the web page tag of the web page B as "data mining and intelligent application", the web page C as "machine learning", the web page D as "deep learning and intelligent development", and the like.
Further, the text distance calculating method may be a euclidean distance calculating method, or the like, so that the search web page set related to the keyword may be screened through the text distance, and exemplary web pages corresponding to the "deep learning and artificial intelligence" keyword may be search web page sets such as web page B, web page C, web page D, or the like.
S2, setting the same initial weight for each search webpage in the search webpage set, and constructing a Web graph according to access links included in each search webpage in the search webpage set.
In the embodiment of the invention, in order to facilitate the subsequent calculation of the historical weight of each web page in the search engine, the initial weights are unified, for example, the initial weights of the web page B, the web page C and the web page D are all set to be 1.
It will be appreciated that there may be access links between web pages, for example, in search web page B, C, D described above, where there is a connection that search web page B points to search web page D, indicating that search web page D is more important than search web page B, thereby assigning a portion of the initial weight of search web page B to search web page D, some of which is the updated weight described above.
Thus, by interdependent access links between Web pages, a Web graph can be constructed for the access links.
And S3, calculating the update weight of each search webpage in turn according to the Web graph, and updating the initial weight according to the update weight to obtain the historical weight corresponding to each search webpage.
In the embodiment of the present invention, the calculating the update weight of each search Web page according to the Web graph includes:
Calculating the outgoing number of each search webpage in turn, and mapping the outgoing number into outgoing weight;
and calculating the ratio of the out-link weight to the initial weight of each search webpage to obtain the updated weight corresponding to each search webpage.
For example, if the number of outgoing chains of the search web page B is 2, the search web page D and the search web page C are respectively linked, and then the outgoing chain weight is calculated by using the number of outgoing chains as a dependent variable according to a preset mapping function, such as a quadratic function, a relu function, and the like, and the ratio of the chain weight to the initial weight is further calculated, that is, the update weight=initial weight\outgoing chain weight. Finally, the embodiment of the present invention adds the update weight to the initial weight to obtain the historical weight corresponding to each search web page, and for example, the historical weight of the search web page D is obtained by adding the update weight calculated by the search web page B to the initial weight of the search web page D.
And S4, sorting the search webpage sets according to the historical weights, and extracting a preset number of search webpages according to the ranks to obtain webpage sets with contents to be extracted.
According to the above, because the reason of accessing the links is that the historical weights of each search web page are different, and the search web pages with large historical weights have larger corresponding importance, the embodiment of the invention performs sorting on the search web page set according to the historical weights, and extracts the search web pages with the preset number of 10, 20 or 30 and the like before ranking, so as to obtain the web page set with the content to be extracted.
S5, performing OCR (optical character recognition) on each content to-be-extracted webpage in the content to-be-extracted webpage set to obtain a first to-be-corrected text set.
It can be understood that the anti-crawler mechanism is already well developed, and part of web pages cannot acquire web page text information by accessing web page source codes, so that the embodiment of the invention recognizes web page text through OCR and further executes text correction according to the trained text recognition model.
It should be explained that the OCR text recognition refers to a process of recognizing characters in text images such as handwritten text, printed text, etc. by using an electronic device, and then translating shapes into computer characters by using a character recognition method.
In the embodiment of the present invention, referring to fig. 2, the performing OCR recognition on each content to-be-extracted web page in the content to-be-extracted web page set to obtain a first to-be-corrected text set includes:
S51, scanning a webpage set to be extracted of the content to obtain a webpage image set;
s52, performing text recognition on the webpage image set to obtain text information, and calculating text confidence of the text information;
And S53, cleaning the text information according to the text confidence degree to obtain the first text set to be corrected.
Further, the step of cleaning the text information according to the text confidence coefficient to obtain the first text set to be corrected includes:
Setting a text confidence threshold;
And when the text confidence coefficient is lower than the text confidence coefficient threshold value, eliminating the corresponding text until the first text set to be corrected is obtained.
It can be appreciated that when the web page image set is identified by the OCR recognition technology, a text corresponding to the web page set to be extracted with the above content, i.e., a first text set to be corrected, can be obtained.
And S6, recognizing the text of the webpage set to be extracted by using the text recognition model which is completed through pre-training, and obtaining a second text set to be corrected.
In the embodiment of the invention, the text recognition model is constructed by a BERT model, and the BERT model is a pre-trained language characterization model, and is different from the traditional model in that the method of performing shallow splicing on two unidirectional language models or the traditional unidirectional language model is not used for pre-training, and a new Masked Language Model (MLM) is used, so that deep bidirectional language characterization can be generated, and more accurate character extraction is achieved.
It should be emphasized, however, that the text recognition model constructed from the BERT model requires prior pre-training to be used for text recognition of the set of web pages from which the content is to be extracted. In detail, referring to fig. 3, the pre-trained text recognition model includes:
s61, receiving an original text set and an original BERT language model, and executing masking operation on the original text set according to a preset percentage to obtain a masking text set;
s62, performing classification training on the original BERT language model according to a preset probability by using the mask text set to obtain a trained BERT language model;
S63, fine-tuning the trained BERT language model to obtain the text recognition model.
In the embodiment of the invention, the original text set is also called a training set, and is webpage text data obtained by downloading and collecting from different webpages in the network in advance. The masking operation refers to the operation of masking the original text set by using mask symbols or other characters, so that the original BERT language model can be conveniently trained to predict the masked characters. In the embodiment of the present invention, the predetermined percentage may be set to 15%, and if some original text in the original text set is one hundred words, fifteen words in the original text set are randomly replaced by mask symbols or other words according to a mask proportion of 15%.
In the embodiment of the present invention, performing classification training on the original BERT language model according to a preset probability is described in detail in the published BERT paper, and will not be described in detail herein.
Further, the fine tuning the trained BERT language model to obtain the text recognition model includes:
Receiving a fine tuning text set, wherein the fine tuning text set comprises a correct sentence set and a corresponding error sentence set, inputting the fine tuning text set into the trained BERT language model, and executing fine tuning on the trained BERT language model by utilizing a pre-built fine tuning method and the correct sentence set to generate a sentence-to-word fine tuning model;
Masking the error words by using the error words in the error sentence set to obtain masked error words, extracting correct words corresponding to the error words from the correct sentence set, setting the correct words as prediction targets, and fine-tuning the trained BERT language model by using the prediction targets and the masked error words to obtain an error sentence error word fine-tuning model;
Masking the correct words by using the correct words in the error sentence set to obtain masked correct words, setting the correct words as a prediction target, and fine-tuning the trained BERT language model by using the prediction target and the masked correct words to obtain a fine-tuning model of the error sentence to word;
And obtaining the text recognition model based on the sentence-to-word fine adjustment model, the sentence-to-word fine adjustment model and the sentence-to-word fine adjustment model.
The fine tuning text set can be selected according to specific application scenes. Belonging to supervised training. The fine-tuning text set includes a text set of a correct sentence set and a corresponding error sentence set. For example: the correct sentence sets are: "today is a good day", and the sentence corresponding to the error sentence set is "day of the beauty today". The original BERT fine tuning method refers to the existing fine tuning method in the BERT language model.
In the embodiment of the invention, in order to keep balance, when the fine adjustment of the two ways is performed on the error sentences, the number of the input error sentences is equal. The original BERT model has strong language understanding capability after being trained by the large-scale task text. Only a small amount of fine-tuning text is needed to carry out fine-tuning on the trained BERT language model, so that the method has strong error correction capability, and overcomes the defects of the traditional error correction model.
In detail, in the embodiment of the invention, the fine tuning adjusts the internal parameters of the trained BERT language model in combination with a loss function while being based on a fine tuning text set. Further, the loss function may be calculated by a method of squaring an error loss function.
And S7, correcting the first text set to be corrected and the second text set to be corrected to obtain webpage content.
And obtaining a first text set to be corrected and a second text set to be corrected under the double recognition of the OCR and the text model on the search web page. Further, comparing the difference between the first text set to be corrected and the second text set to be corrected, and sending the difference to the manual correction in a remarkable form such as highlighting, so as to obtain the webpage content corresponding to the search webpage set.
Compared with the background art, the method comprises the following steps: when the web page content is too much, the recognition efficiency of the web page content and the problem that the OCR recognition accuracy is to be improved can be directly reduced. The method comprises the steps of firstly executing webpage eliminating operation, specifically, after a search engine is started, searching a search webpage set related to keywords in the search engine, wherein the number of webpages of the search webpage set is huge, so that the same initial weight is set for each search webpage in the search webpage set, a Web graph is constructed according to access links included in each search webpage in the search webpage set, the update weight of each search webpage is calculated according to the Web graph in sequence, the initial weight is updated according to the update weight, historical weights corresponding to each search webpage are obtained, further, sorting is performed on the search webpage set according to the historical weights, and a preset number of search webpages are extracted according to ranking, so that a webpage set to be extracted is obtained. In addition, in order to improve the accuracy of OCR recognition, the embodiment of the invention further introduces a text recognition model which is finished by pre-training, recognizes the text of the webpage set to be extracted from the content, and improves the accuracy of recognizing the webpage content through the text recognition model and the double recognition result of OCR. Therefore, the webpage content extraction method, the webpage content extraction device, the electronic equipment and the computer readable storage medium based on the page weighting can solve the problems that the efficiency of identifying the webpage content is low and the identification accuracy of OCR technology is to be further improved when the webpage content is too much.
Fig. 4 is a functional block diagram of a web content extraction device based on page weighting according to an embodiment of the present invention.
The web page content extraction device 100 based on the page weighting can be installed in an electronic device. Depending on the functions implemented, the page weighting-based web content extraction apparatus 100 may include a search web page construction module 101, a historical weight calculation module 102, a web page ranking module 103, an OCR recognition module 104, and a web page content extraction module 105. The module of the invention, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.
The search web page construction module 101 is configured to start a search engine and receive keywords, and search a search web page set related to the keywords in the search engine;
The historical weight calculation module 102 is configured to set the same initial weight for each search Web page in the search Web page set, construct a Web graph according to access links included in each search Web page in the search Web page set, sequentially calculate an update weight of each search Web page according to the Web graph, and update the initial weight according to the update weight to obtain a historical weight corresponding to each search Web page;
The web page ranking module 103 is configured to perform ranking on the search web page set according to the historical weight, and extract a preset number of search web pages according to the ranking, so as to obtain a web page set to be extracted with content;
the OCR recognition module 104 is configured to perform OCR recognition on each content to-be-extracted web page in the content to-be-extracted web page set to obtain a first to-be-corrected text set;
The web page content extraction module 105 is configured to identify a text of the web page set to be extracted by using a text identification model that is pre-trained, obtain a second text set to be corrected, and perform correction on the first text set to be corrected and the second text set to be corrected, so as to obtain web page content.
In detail, the modules in the web content extraction device 100 based on the page weighting in the embodiment of the present invention use the same technical means as the web content extraction method based on the page weighting described in fig. 1, and can generate the same technical effects, which are not described herein.
Fig. 5 is a schematic structural diagram of an electronic device for implementing a method for extracting web page content based on page weighting according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program stored in the memory 11 and executable on the processor 10, such as a web page content extraction method program 12 based on page weighting.
The memory 11 includes at least one type of readable storage medium, including flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may in other embodiments also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of the page content extraction method program 12 based on page weighting, but also for temporarily storing data that has been output or is to be output.
The processor 10 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects respective components of the entire electronic device using various interfaces and lines, executes or executes programs or modules (e.g., a web content extraction method program based on page weighting, etc.) stored in the memory 11, and invokes data stored in the memory 11 to perform various functions of the electronic device 1 and process data.
The bus may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 11 and at least one processor 10 etc.
Fig. 5 shows only an electronic device with components, it being understood by a person skilled in the art that the structure shown in fig. 5 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or may be arranged in different components.
For example, although not shown, the electronic device 1 may further include a power source (such as a battery) for supplying power to each component, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 1 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.
Further, the electronic device 1 may also comprise a network interface, optionally the network interface may comprise a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used for establishing a communication connection between the electronic device 1 and other electronic devices.
The electronic device 1 may optionally further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.
It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
The page-weighted web content extraction method program 12 stored in the memory 11 of the electronic device 1 is a combination of a plurality of instructions, which when executed in the processor 10, may implement:
starting a search engine and receiving keywords, and searching a search webpage set related to the keywords in the search engine;
setting the same initial weight for each search webpage in the search webpage set, and constructing a Web graph according to access links included in each search webpage in the search webpage set;
According to the Web graph, sequentially calculating the update weight of each search webpage, and updating the initial weight according to the update weight to obtain the historical weight corresponding to each search webpage;
Sorting the search webpage sets according to the historical weights, and extracting a preset number of search webpages according to the ranks to obtain webpage sets with contents to be extracted;
Performing OCR (optical character recognition) on each content to-be-extracted webpage in the content to-be-extracted webpage set to obtain a first to-be-corrected text set;
Identifying the text of the webpage set to be extracted from the content by using the text identification model which is completed through pre-training, so as to obtain a second text set to be corrected;
and correcting the first text set to be corrected and the second text set to be corrected to obtain webpage content.
Specifically, the specific implementation method of the above instructions by the processor 10 may refer to descriptions of related steps in the corresponding embodiments of fig. 1 to 5, which are not repeated herein.
Further, the modules/units integrated in the electronic device 1 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products. The computer readable storage medium may be volatile or nonvolatile. For example, the computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).
The present invention also provides a computer readable storage medium storing a computer program which, when executed by a processor of an electronic device, can implement:
starting a search engine and receiving keywords, and searching a search webpage set related to the keywords in the search engine;
setting the same initial weight for each search webpage in the search webpage set, and constructing a Web graph according to access links included in each search webpage in the search webpage set;
According to the Web graph, sequentially calculating the update weight of each search webpage, and updating the initial weight according to the update weight to obtain the historical weight corresponding to each search webpage;
Sorting the search webpage sets according to the historical weights, and extracting a preset number of search webpages according to the ranks to obtain webpage sets with contents to be extracted;
Performing OCR (optical character recognition) on each content to-be-extracted webpage in the content to-be-extracted webpage set to obtain a first to-be-corrected text set;
Identifying the text of the webpage set to be extracted from the content by using the text identification model which is completed through pre-training, so as to obtain a second text set to be corrected;
and correcting the first text set to be corrected and the second text set to be corrected to obtain webpage content.
In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.
The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims (8)

1. The webpage content extraction method based on the page weighting is characterized by comprising the following steps:
Starting a search engine and receiving keywords, and searching a search webpage set related to the keywords in the search engine;
The searching the search engine for the search web page set related to the keyword comprises: searching out a webpage database corresponding to the search engine, and extracting webpage labels included in the webpage database to obtain a plurality of groups of webpage label sets; calculating the text distance between the keyword and each webpage label set; screening the search web pages with the text distance smaller than a specified threshold value to obtain the search web page set;
the extracting the webpage labels included in the webpage database to obtain a plurality of sets of webpage label sets includes: extracting webpage keywords of each webpage in the webpage database in sequence to obtain a webpage keyword set; performing stop word removal processing on each webpage keyword in the webpage keyword set to obtain a core keyword set; recombining each core keyword to obtain a webpage label set corresponding to the webpage;
setting the same initial weight for each search webpage in the search webpage set, and constructing a Web graph according to access links included in each search webpage in the search webpage set;
According to the Web graph, sequentially calculating the update weight of each search webpage, and updating the initial weight according to the update weight to obtain the historical weight corresponding to each search webpage;
The step of calculating the update weight of each search webpage in turn according to the Web graph comprises the following steps: calculating the outgoing number of each search webpage in turn, and mapping the outgoing number into outgoing weight; calculating the ratio of the out-link weight to the initial weight of each search webpage to obtain the corresponding update weight of each search webpage;
calculating the outgoing chain number as a dependent variable according to a preset mapping function to obtain a chain weight, calculating the ratio of the chain weight to the initial weight to obtain an updated weight, and adding the updated weight and the initial weight to obtain a historical weight corresponding to each search webpage;
Sorting the search webpage sets according to the historical weights, and extracting a preset number of search webpages according to the ranks to obtain webpage sets with contents to be extracted;
Performing OCR (optical character recognition) on each content to-be-extracted webpage in the content to-be-extracted webpage set to obtain a first to-be-corrected text set;
Identifying the text of the webpage set to be extracted from the content by using the text identification model which is completed through pre-training, so as to obtain a second text set to be corrected;
and correcting the first text set to be corrected and the second text set to be corrected to obtain webpage content.
2. The method for extracting web page content based on page weighting according to claim 1, wherein performing OCR recognition on each web page to be extracted of the set of web pages to be extracted of content to obtain a first set of text to be corrected comprises:
Scanning the webpage set to be extracted of the content to obtain a webpage image set;
Performing text recognition on the webpage image set to obtain text information, and calculating text confidence of the text information;
and clearing the text information according to the text confidence degree to obtain the first text set to be corrected.
3. The method for extracting web page content based on page weighting as claimed in claim 2, wherein said clearing the text information according to the text confidence level to obtain the first text set to be corrected comprises:
Setting a text confidence threshold;
And when the text confidence coefficient is lower than the text confidence coefficient threshold value, eliminating the corresponding text until the first text set to be corrected is obtained.
4. The method for extracting web page content based on page weighting as claimed in claim 1, wherein the pre-trained text recognition model comprises:
Receiving an original text set and an original BERT language model, and executing masking operation on the original text set according to a preset percentage to obtain a masking text set;
performing classification training on the original BERT language model according to a preset probability by using the mask text set to obtain a trained BERT language model;
and fine-tuning the trained BERT language model to obtain the text recognition model.
5. The method for extracting web page content based on page weighting as recited in claim 4, wherein said fine tuning the trained BERT language model to obtain the text recognition model comprises:
Receiving a fine tuning text set, wherein the fine tuning text set comprises a correct sentence set and a corresponding error sentence set, inputting the fine tuning text set into the trained BERT language model, and executing fine tuning on the trained BERT language model by utilizing a pre-built fine tuning method and the correct sentence set to generate a sentence-to-word fine tuning model;
Masking the error words by using the error words in the error sentence set to obtain masked error words, extracting correct words corresponding to the error words from the correct sentence set, setting the correct words as prediction targets, and fine-tuning the trained BERT language model by using the prediction targets and the masked error words to obtain an error sentence error word fine-tuning model;
Masking the correct words by using the correct words in the error sentence set to obtain masked correct words, setting the correct words as a prediction target, and fine-tuning the trained BERT language model by using the prediction target and the masked correct words to obtain a fine-tuning model of the error sentence to word;
And obtaining the text recognition model based on the sentence-to-word fine adjustment model, the sentence-to-word fine adjustment model and the sentence-to-word fine adjustment model.
6. A web content extraction apparatus based on page weighting, the apparatus comprising:
The search web page construction module is used for starting a search engine and receiving keywords, and searching a search web page set related to the keywords in the search engine;
The searching the search engine for the search web page set related to the keyword comprises: searching out a webpage database corresponding to the search engine, and extracting webpage labels included in the webpage database to obtain a plurality of groups of webpage label sets; calculating the text distance between the keyword and each webpage label set; screening the search web pages with the text distance smaller than a specified threshold value to obtain the search web page set;
the extracting the webpage labels included in the webpage database to obtain a plurality of sets of webpage label sets includes: extracting webpage keywords of each webpage in the webpage database in sequence to obtain a webpage keyword set; performing stop word removal processing on each webpage keyword in the webpage keyword set to obtain a core keyword set; recombining each core keyword to obtain a webpage label set corresponding to the webpage;
the historical weight calculation module is used for setting the same initial weight for each search webpage in the search webpage set, constructing a Web graph according to access links included in each search webpage in the search webpage set, sequentially calculating the update weight of each search webpage according to the Web graph, and updating the initial weight according to the update weight to obtain the historical weight corresponding to each search webpage;
The step of calculating the update weight of each search webpage in turn according to the Web graph comprises the following steps: calculating the outgoing number of each search webpage in turn, and mapping the outgoing number into outgoing weight; calculating the ratio of the out-link weight to the initial weight of each search webpage to obtain the corresponding update weight of each search webpage;
calculating the outgoing chain number as a dependent variable according to a preset mapping function to obtain a chain weight, calculating the ratio of the chain weight to the initial weight to obtain an updated weight, and adding the updated weight and the initial weight to obtain a historical weight corresponding to each search webpage;
The webpage ranking module is used for performing ranking on the search webpage sets according to the historical weights, extracting a preset number of search webpages according to the ranking, and obtaining a webpage set with contents to be extracted;
the OCR recognition module is used for executing OCR recognition on each content to-be-extracted webpage in the content to-be-extracted webpage set to obtain a first to-be-corrected text set;
And the webpage content extraction module is used for identifying texts of the webpage set to be extracted by utilizing the text identification model which is completed through pre-training to obtain a second text set to be corrected, and correcting the first text set to be corrected and the second text set to be corrected to obtain webpage content.
7. An electronic device, the electronic device comprising:
At least one processor; and
A memory communicatively coupled to the at least one processor; wherein,
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the page weighting based web page content extraction method of any one of claims 1 to 5.
8. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the page weighting based web page content extraction method according to any one of claims 1 to 5.
CN202210184453.9A 2022-02-27 2022-02-27 Webpage content extraction method and device based on page weighting and electronic equipment Active CN115525730B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210184453.9A CN115525730B (en) 2022-02-27 2022-02-27 Webpage content extraction method and device based on page weighting and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210184453.9A CN115525730B (en) 2022-02-27 2022-02-27 Webpage content extraction method and device based on page weighting and electronic equipment

Publications (2)

Publication Number Publication Date
CN115525730A CN115525730A (en) 2022-12-27
CN115525730B true CN115525730B (en) 2024-04-19

Family

ID=84693449

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210184453.9A Active CN115525730B (en) 2022-02-27 2022-02-27 Webpage content extraction method and device based on page weighting and electronic equipment

Country Status (1)

Country Link
CN (1) CN115525730B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101807213A (en) * 2010-05-11 2010-08-18 天津大学 Method for vertical search of webpage
CN107798070A (en) * 2017-09-26 2018-03-13 平安普惠企业管理有限公司 A kind of web data acquisition methods and terminal device
CN113095067A (en) * 2021-03-03 2021-07-09 北京邮电大学 OCR error correction method, device, electronic equipment and storage medium
CN113449168A (en) * 2021-07-14 2021-09-28 北京锐安科技有限公司 Method, device and equipment for capturing theme webpage data and storage medium
CN113850251A (en) * 2021-09-16 2021-12-28 多益网络有限公司 Text correction method, device and equipment based on OCR technology and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101807213A (en) * 2010-05-11 2010-08-18 天津大学 Method for vertical search of webpage
CN107798070A (en) * 2017-09-26 2018-03-13 平安普惠企业管理有限公司 A kind of web data acquisition methods and terminal device
CN113095067A (en) * 2021-03-03 2021-07-09 北京邮电大学 OCR error correction method, device, electronic equipment and storage medium
CN113449168A (en) * 2021-07-14 2021-09-28 北京锐安科技有限公司 Method, device and equipment for capturing theme webpage data and storage medium
CN113850251A (en) * 2021-09-16 2021-12-28 多益网络有限公司 Text correction method, device and equipment based on OCR technology and storage medium

Also Published As

Publication number Publication date
CN115525730A (en) 2022-12-27

Similar Documents

Publication Publication Date Title
CN112364170B (en) Data emotion analysis method and device, electronic equipment and medium
CN113378970B (en) Sentence similarity detection method and device, electronic equipment and storage medium
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN112883730B (en) Similar text matching method and device, electronic equipment and storage medium
CN116701574A (en) Text semantic similarity calculation method, device, equipment and storage medium
CN113344125B (en) Long text matching recognition method and device, electronic equipment and storage medium
CN115238115A (en) Image retrieval method, device and equipment based on Chinese data and storage medium
CN113360654B (en) Text classification method, apparatus, electronic device and readable storage medium
CN112632264A (en) Intelligent question and answer method and device, electronic equipment and storage medium
CN112579781A (en) Text classification method and device, electronic equipment and medium
CN116578696A (en) Text abstract generation method, device, equipment and storage medium
CN115525730B (en) Webpage content extraction method and device based on page weighting and electronic equipment
WO2023178798A1 (en) Image classification method and apparatus, and device and medium
CN113515591B (en) Text defect information identification method and device, electronic equipment and storage medium
CN112529743B (en) Contract element extraction method, device, electronic equipment and medium
CN114943306A (en) Intention classification method, device, equipment and storage medium
CN114676307A (en) Ranking model training method, device, equipment and medium based on user retrieval
CN114385815A (en) News screening method, device, equipment and storage medium based on business requirements
CN112632260A (en) Intelligent question and answer method and device, electronic equipment and computer readable storage medium
CN114462411B (en) Named entity recognition method, device, equipment and storage medium
CN115146596B (en) Recall text generation method and device, electronic equipment and storage medium
CN115146627B (en) Entity identification method, entity identification device, electronic equipment and storage medium
CN113157865B (en) Cross-language word vector generation method and device, electronic equipment and storage medium
CN115525731B (en) Webpage weight calculation method and device based on improved pagerank algorithm and electronic equipment
CN116662513A (en) Text searching method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20240326

Address after: 276002, 19th Floor, Building 10, Evergrande Huafu, 100 meters north of the intersection of Xiaohe Road and Chengdu Road, Lanshan District, Linyi City, Shandong Province

Applicant after: Shandong Vision Digital Technology Co.,Ltd.

Country or region after: China

Address before: 315048 Building A2-3, East Zone, New Materials (International) Innovation Center, No. 2660, Yongjiang Avenue, High tech Zone, Ningbo, Zhejiang

Applicant before: Bocai Hui (Ningbo) Information Technology Co.,Ltd.

Country or region before: China

GR01 Patent grant
GR01 Patent grant