CN111931113A - Data cleaning method and related equipment - Google Patents

Data cleaning method and related equipment Download PDF

Info

Publication number
CN111931113A
CN111931113A CN202010971243.5A CN202010971243A CN111931113A CN 111931113 A CN111931113 A CN 111931113A CN 202010971243 A CN202010971243 A CN 202010971243A CN 111931113 A CN111931113 A CN 111931113A
Authority
CN
China
Prior art keywords
text
link
data
webpage
html
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010971243.5A
Other languages
Chinese (zh)
Other versions
CN111931113B (en
Inventor
李超
徐国强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
OneConnect Smart Technology Co Ltd
OneConnect Financial Technology Co Ltd Shanghai
Original Assignee
OneConnect Financial Technology Co Ltd Shanghai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by OneConnect Financial Technology Co Ltd Shanghai filed Critical OneConnect Financial Technology Co Ltd Shanghai
Priority to CN202010971243.5A priority Critical patent/CN111931113B/en
Publication of CN111931113A publication Critical patent/CN111931113A/en
Application granted granted Critical
Publication of CN111931113B publication Critical patent/CN111931113B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the application discloses a data cleaning method and related equipment, which are applied to the technical field of data processing of medical systems, and the method comprises the following steps: crawling webpage data and extracting HTML texts from the webpage data; deleting a designated code block in the HTML text to obtain a first text; replacing the line feed label in the first text with a line feed character to obtain a second text; deleting the part of the second text except the target attribute value in the content corresponding to the link label to reserve the link in the second text and obtain a third text comprising the link; and if the link in the third text is detected to be complete, deleting the HTML tag and the continuous line feed character in the third text to obtain the target data after the webpage data are cleaned. The method can enable the webpage data after data cleaning to still keep the paragraph format and the link position of the text content in the original webpage, and is convenient for the medical system to quickly locate the required content or link from the extracted target content in the follow-up process.

Description

Data cleaning method and related equipment
Technical Field
The application relates to the technical field of digital medical treatment, is applied to the technical field of data processing, and particularly relates to a data cleaning method and related equipment.
Background
Data cleansing is the process of re-examining and verifying data with the aim of deleting duplicate information, correcting existing errors, and providing data consistency. Because the data in the data warehouse is a collection of data oriented to a certain subject, the data is extracted from a plurality of business systems and contains historical data, and therefore, the condition that some data are wrong data and some data conflict with each other is avoided, and the wrong or conflicting data are obviously unwanted and are called as 'dirty data'. We need to "wash" dirty data according to certain rules, which is data washing.
When public data is crawled through a web crawler in a medical system, a hypertext Markup Language (HTML) page in the data to be searched may need to be cleaned, and a paragraph format, a paragraph, a picture link, a hyperlink and the like are accurately extracted from the HTML page.
Disclosure of Invention
The embodiment of the application provides a data cleaning method and related equipment, which can enable webpage data after data cleaning to still keep the paragraph format of text content in an original webpage, and can also keep position information of a link, so that the position of the link in the cleaned webpage data is consistent with that of the original webpage. The convenient medical system can then quickly locate the required content or link from the extracted target content.
In a first aspect, an embodiment of the present application provides a data cleansing method, where the method includes:
crawling webpage data and extracting HTML texts from the webpage data;
deleting the designated code block in the HTML text to obtain a first text;
replacing the line feed label in the first text with a line feed character to obtain a second text;
deleting the part of the second text except the target attribute value in the content corresponding to the link label to reserve the link in the second text, so as to obtain a third text comprising the link;
and if the link in the third text is detected to be complete, deleting the HTML tag and the continuous line feed character in the third text to obtain the target data after the webpage data are cleaned.
In an optional embodiment, the method further comprises: judging whether a field corresponding to the first position of the link in the third text is a first preset link field; if yes, judging that the link in the third text is complete; if not, judging that the link in the third text is incomplete.
In an optional embodiment, after the determining that the link in the third text is incomplete, the method further includes: acquiring a webpage link of a webpage corresponding to the webpage data, and judging whether a field corresponding to a second position in the webpage link is a second preset link field; if so, extracting a network protocol from the webpage link, and performing completion processing on the link based on the network protocol.
In an optional embodiment, after the determining that the link in the third text is incomplete, the method further includes: if the field corresponding to the second position in the webpage link is judged not to be the second preset link field, judging whether the field corresponding to the third position in the webpage link is a third preset link field or not; if so, counting the occurrence frequency N of the third preset link field in the link of the third text, wherein N is an integer greater than 1; deleting appointed partial fields before and after the last N third preset link fields from the webpage link to obtain a basic link; and performing completion processing on the link based on the basic link.
In an optional embodiment, the deleting a specified code block in the HTML text to obtain a first text includes: filtering characters matched with a pre-established filtering character list in the HTML text; and deleting the specified code block from the HTML text after the character filtering to obtain a first text.
In an alternative embodiment, extracting HTML text from the web page data includes: and screening target content data from the webpage data through a regular expression, and extracting an HTML text from the target content data.
In an optional embodiment, after deleting the HTML tag and the consecutive line breaks in the third text to obtain the target data after the web page data is cleaned, the method further includes: and based on a field processing rule, carrying out standardization processing on the field of the target type in the target data.
In a second aspect, the present application provides a data cleansing apparatus including means for performing the method of the first aspect.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor, a network interface, and a memory, where the processor, the network interface, and the memory are connected to each other, where the network interface is controlled by the processor to send and receive messages, and the memory is used to store a computer program that supports the electronic device to execute the above method, where the computer program includes program instructions, and the processor is configured to call the program instructions to execute the method of the first aspect.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program, the computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method of the first aspect.
In the embodiment of the application, after the webpage data to be cleaned is crawled, the specified code block is deleted aiming at the HTML text corresponding to the webpage data, namely, unnecessary data such as advertisements and picture introductions at the ends of a document are cleaned. The label with line feed significance in the HTML text of the cleaned webpage data is replaced by a line feed character, so that the webpage data after the data cleaning still keeps the paragraph format of the text content in the original webpage. In addition, for the HTML text of the cleaned webpage data, the part except the target attribute value in the content corresponding to the link tag is deleted to reserve the link of the cleaned webpage data, so that the position of the link in the cleaned webpage data is consistent with that of the original webpage. The convenient medical system can then quickly locate the required content or link from the extracted target content.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a data cleansing method according to an embodiment of the present application;
FIG. 2 is a schematic flow chart diagram of another data cleansing method provided in the embodiments of the present application;
FIG. 3 is a flowchart illustrating a detailed process of the data cleansing method shown in FIG. 2 after determining that the link in the third text is incomplete;
FIG. 4 is a schematic flow chart illustrating another data cleansing method according to an embodiment of the present disclosure;
FIG. 5 is a schematic block diagram of a data cleansing apparatus according to an embodiment of the present application;
fig. 6 is a schematic block diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Data cleansing is the process of re-examining and verifying data with the aim of deleting duplicate information, correcting existing errors, and providing data consistency. Because the data in the data warehouse is a collection of data oriented to a certain subject, the data is extracted from a plurality of business systems and contains historical data, and therefore, the condition that some data are wrong data and some data conflict with each other is avoided, and the wrong or conflicting data are obviously unwanted and are called as 'dirty data'. We need to "wash" dirty data according to certain rules, which is data washing.
In the conventional data cleaning, when public data is crawled by a web crawler, a hypertext Markup Language (HTML) page in the required data may need to be cleaned, a paragraph format, a paragraph, a picture link, a hyperlink and the like are accurately extracted from the HTML page, and other unnecessary HTML code blocks are deleted.
Referring to fig. 1, fig. 1 is a schematic flow chart of a data cleansing method provided in an embodiment of the present application, and as shown in the drawing, the data cleansing method may include:
101. crawling web page data and extracting HTML text from the web page data.
In the embodiment of the application, the webpage data are crawled, namely, the crawler software is adopted to crawl the public data of the webpage to be subjected to data cleaning. HTML text is extracted from web page data, i.e., HTML text in public data is searched based on XML Path Language (XPath).
XPath, which is a language for finding information in XML text, is one of the most important languages. Originally intended for searching XML text, but the same applies to searching HTML text.
102. And deleting the specified code block in the HTML text to obtain the first text.
In the embodiment of the present application, the specified HTML code blocks may include unused HTML code blocks such as Cascading Style Sheets (CSS) code blocks, script (JS) code blocks, comment code blocks, and the like.
It can be understood that after the unnecessary specified code blocks in the HTML text are deleted, the HTML text is subjected to data cleaning, and a first text is obtained, and the first text is also the HTML text.
In the embodiment of the present application, the specified HTML code block can be deleted by regular expression, for example, by regular expression
Figure 975410DEST_PATH_IMAGE001
To delete the similarities "<style type=”text/css”>.div{margin:auto}...</style>"the CSS code block is similar to the CSS code block except that other HTML code blocks are deleted, and the description is omitted here.
It can be understood that the designated HTML code block is deleted in a regular expression mode, so that the method can be flexibly applied to any scene, and the universality is high. The specified HTML code block may further include a custom HTML code block. And in combination with the actual application scene, the user can delete the self-defined HTML code blocks such as advertisements and picture introduction in the HTML text through the self-defined regular expression list.
103. And replacing the line feed label in the first text with a line feed character to obtain a second text.
In the embodiment of the present application, the "line feed label" having the line feed meaning in the first text is replaced with a line feed character such as "\ n". The paragraph format of the text content in the original web page can be retained.
Specifically, the "line feed label" with the line feed meaning includes "p", "li", "tr", "br", and the like, for example, "< p > is a section of word … … </p >, and when the" p "label is replaced," \ n is obtained, which is a section of word … … \ n ".
104. And deleting the part of the second text except the target attribute value in the content corresponding to the link label to reserve the link in the second text, so as to obtain a third text comprising the link. Wherein, the link label refers to < a > label, and the target attribute value is the value corresponding to the href attribute of the < a > label.
In the embodiment of the application, the target attribute value in the content corresponding to the link tag in the second text after replacing the line break is extracted, that is, the part except the href attribute value in the content corresponding to the < a > tag is deleted, so as to retain the hyperlink. For example: and (c) deleting the part except the href attribute value http:// baidu.com/>, which corresponds to the < a > label, in the content, and only keeping the href attribute value http:// baidu.com/, at the original text position. Similarly, the same process of keeping the picture links is not described herein. In this way, the original location information of the link can be preserved.
Where the href attribute of the < a > tag is used to specify the Resource Locator (URL) of the hyperlink target. The href attribute value may be the relative or absolute URL of any valid document, including the fragment identifier and the JS code segment.
105. And if the link in the third text is detected to be complete, deleting the HTML tag and the continuous line feed character in the third text to obtain the target data after the webpage data are cleaned.
In this embodiment of the application, the third text is the HTML text after the line break has been replaced and the target attribute value in the content corresponding to the link tag is extracted. And deleting the HTML tags and the continuous line feed characters in the third text to obtain the target data after the webpage data are cleaned, namely the webpage public data cleaned by the data cleaning method.
In the embodiment of the present application, the execution order of steps 103 and 104 is not limited, and step 103 may be executed first and then step 104 is executed, or step 104 may be executed first and then step 103 is executed, or step 103 and step 104 are executed simultaneously.
It can be understood that, in the data cleaning method disclosed in the embodiment of the present application, after the HTML text corresponding to the public data of the web page to be cleaned is acquired, the unnecessary designated code blocks, such as advertisements and picture introductions of the document ends, are deleted, and the first text is acquired. At this time, the first text does not keep the paragraph format and the position information of the link, which is not beneficial to the subsequent use. Therefore, the first text is subjected to the line break replacing operation so as to keep the paragraph format of the body content in the original webpage. And further extracting the target attribute value in the content corresponding to the link label so as to reserve and reserve the original position information of the link. Therefore, the webpage data cleaned by the data cleaning method still keeps the paragraph format of the text content in the original webpage, and meanwhile, the position information of the link can be kept, so that the medical system can conveniently and quickly locate the required content or link from the extracted target content in the follow-up process.
In an alternative embodiment, extracting HTML text from web page data includes: and screening target content data from the webpage data through the regular expression, and extracting an HTML text from the target content data.
In the embodiment of the application, a user can individually screen out the part needing data cleaning from the webpage data through the regular expression to clean, and other parts not needing data cleaning in the webpage data are continuously reserved.
In an alternative embodiment, deleting a specified code block in the HTML text to obtain the first text includes: filtering characters matched with a pre-established filtering character list in the HTML text; and deleting the specified code block from the HTML text after the character filtering to obtain a first text.
In the embodiment of the present application, the characters may be filtered for the HTML text extracted from the web page data, and the following steps 103, 104, and 105 are performed based on the public data after filtering the filtered characters. Wherein, the filtering characters can include HTML characters and designated characters, such as characters at the beginning of "\ u", "\ x", and the manner of filtering the characters can include: firstly, replacing HTML characters with corresponding utf-8 characters/null characters according to an HTML filtering character corresponding table; then, characters of common \\ u \ "and \" x \ "beginning, such as \" u2000 \ "and \" x7f, are collected in the actual project, and a filtering character list, namely a pre-established filtering character list, is established; finally, after crawling to the public data, characters in the public data that match the filtered character list may be filtered.
Referring to fig. 2, fig. 2 is a schematic flow chart of another data cleansing method provided in an embodiment of the present application, and as shown in the figure, the data cleansing method may further include, in addition to the steps shown in fig. 1:
205. judging whether a field corresponding to the first position of the link in the third text is a first preset link field; if yes, judging that the link in the third text is complete; if not, judging that the link in the third text is incomplete.
It can be understood that in the embodiment of the present application, it is necessary to determine whether the link in the third text is complete, and if it is detected that the link in the third text is complete, the HTML tag and the continuous line break in the third text are deleted, so as to obtain the target data after the web page data is cleaned; if the link in the third text is detected to be incomplete, a corresponding completion step should be executed to complete the link.
In the embodiment of the present application, a specific implementation manner of detecting whether the link is complete is as follows:
judging whether a field corresponding to the first position of the link in the third text is a first preset link field; if yes, judging that the link in the third text is complete; if not, judging that the link in the third text is incomplete.
In the embodiment of the application, the preset link field can be fields such as http://, https://; the first position refers to the beginning character of the link in the third text. For example, the following steps are carried out: and judging whether the link is the beginning of http://' and https://, if so, judging that the link is complete, and if not, completing the link to ensure that the link is complete.
Referring to fig. 3, fig. 3 is a schematic diagram of a specific flow of the data cleansing method shown in fig. 2 after the link in the third text is determined to be incomplete, including:
301. and judging that the link in the third text is incomplete.
302. And acquiring a webpage link of a webpage corresponding to the webpage data, and judging whether a field corresponding to a second position in the webpage link is a second preset link field.
In this embodiment of the present application, the second preset linking field may be ": a/or "//" field, the second position also refers to the beginning character of the link in the third text. Judging whether the page link is' or not: if yes, the page link is missing "http", "https: "etc.
303. If so, extracting the network protocol from the webpage link, and performing completion processing on the link based on the network protocol so as to complete the link.
In the embodiment of the present application, it is determined whether the page link is ": if yes, the page link is missing "http", "https: "etc. fields from which network protocols, such as http, https, etc., may be extracted to complete incomplete links.
304. If not, judging whether a field corresponding to a third position in the webpage link is a third preset link field.
In this embodiment of the application, the third preset link field may be a field such as "/", and the third position refers to a last character of a link in the third text. For example, if the page link is not a page link, whether the page link is': if the page link is terminated with a "/", then the page link is terminated with a "/".
305. And if so, counting the occurrence frequency N of the third preset link field in the link of the third text, wherein N is an integer greater than 1.
For example, if the current page link ends with "/", deleting the "/" at the end, counting the number of times of occurrence of the "./" in the incomplete link as n, deleting the parts before and after the last n "/" of the current page link, splicing to obtain a basic link, replacing the "./" and the "./' in the incomplete link as null characters, and splicing the basic link and the incomplete link to obtain a complete link. For example, the current page link is "https:// www.baidu.com/tech/news/1234. html", the incomplete link is "./image/5678. jpg", "./" appears 2 times in the incomplete link. Then the portion before and after the last 2 "/" links of the current page are deleted resulting in the base link "https:// www.baidu.com/". The incomplete link replaces "/" as a null character, resulting in "image/5678. jpg", and is spliced with the base link, resulting in the final link "https:// www.baidu.com/image/5678. jpg". It is particularly noted that when "./" appears 0 times in an incomplete link, deleting the portion behind the last "/" of the current page link is the base link.
Referring to fig. 4, fig. 4 is a schematic flow chart of another data cleansing method provided in an embodiment of the present application, and as shown in the drawing, the data cleansing method may further include, in addition to the steps shown in fig. 1:
406. and based on the field processing rule, carrying out standardization processing on the field of the target type in the target data.
In this embodiment of the present application, the field of the target type may include a time field, and the field of the target type in the target data is normalized, that is, the time field is normalized.
It can be understood that, for the service scenario of information distribution, the service requirement is to screen out the information of specific distribution time. If the format of the release time is five-door, the screening cannot be carried out, and after the time format is unified, the screening can be quickly finished through the regular expression.
Specifically, the format of the time field can be unified into "YYYY-MM-dd HH: MM: SS", which is convenient for subsequent direct use. Wherein "YYYY" represents year, such as "2020" for 2020, "MM" for month, such as "07" for 7 months, "dd" for date, such as "25" for 25 days, "HH" for time, such as "15" for 15, "MM" for minute, such as "16" for 16 minutes, "SS" for second, such as "09" for "9 seconds," such that the current time is "2020-07-2515: 16: 09 "represents the current time as" 7/25/15/16/9/sec "in 2020.
For example, the specific way of normalizing the time field may be:
1. processing formats for today, yesterday, sometime before. If the ' 20:20 ' today ' is converted into ' 2020-07-2517:20 '.
2. X seconds ago, x minutes ago, x hours ago, etc. were converted to the corresponding times. "before 2 hours" was changed to "2020-07-2513: 16: 09".
The embodiment of the application also provides a data cleaning device. The apparatus includes means for performing the methods of fig. 1-4 as previously described. Specifically, referring to fig. 5, a schematic block diagram of a data cleansing apparatus provided in an embodiment of the present application is shown. The data washing apparatus of the present embodiment includes:
the acquiring module 501 is configured to crawl web page data and extract an HTML text from the web page data;
the processing module 502 is configured to delete a specified code block in the HTML text to obtain a first text;
the processing module 502 is further configured to replace the line feed label in the first text with a line feed character to obtain a second text;
the processing module 502 is further configured to delete a portion of the second text other than the target attribute value in the content corresponding to the link tag, so as to retain the link in the second text, and obtain a third text including the link;
the processing module 502 is further configured to delete the HTML tag and the continuous line break in the third text if it is detected that the link in the third text is complete, so as to obtain the target data after the web page data is cleaned.
In an optional embodiment, the processing module 502 is further configured to determine whether a field corresponding to a first position of a link in the third text is a first preset link field; if yes, judging that the link in the third text is complete; if not, judging that the link in the third text is incomplete.
In an optional embodiment, the processing module 502 is further configured to obtain a web page link of a web page corresponding to the web page data, and determine whether a field corresponding to a second position in the web page link is a second preset link field; the processing module 502 is further configured to, if the field corresponding to the second location in the web page link is a second preset link field, extract a network protocol from the web page link, and perform completion processing on the link based on the network protocol.
In an optional embodiment, the processing module 502 is further configured to determine whether a third location corresponding field in the web page link is a third preset link field if it is determined that the second location corresponding field in the web page link is not the second preset link field; the processing module 502 is further configured to count a number N of times that a third preset link field appears in a link of a third text if a field corresponding to a third position in the web link is the third preset link field, where N is an integer greater than 1; the processing module 502 is further configured to delete the designated partial fields before and after the last N third preset link fields from the web page link to obtain a basic link; the processing module 502 is further configured to perform a completion process on the link based on the basic link.
In an alternative embodiment, the processing module 502 is specifically configured to filter characters in an HTML text, which are matched with a pre-established filtered character list; the processing module 502 is specifically configured to delete the specified code block from the HTML text after the character filtering, so as to obtain a first text.
In an optional embodiment, the processing module 502 is specifically configured to screen target content data from the web page data through a regular expression, and extract an HTML text from the target content data.
In an alternative embodiment, the processing module 502 is further configured to perform a normalization process on the field of the target type in the target data based on the field processing rule.
It should be noted that the functions of the functional modules of the data cleaning apparatus described in the embodiment of the present application may be specifically implemented according to the method in the method embodiment in fig. 1 to 4, and the specific implementation process may refer to the related description of the method embodiment in fig. 1 to 4, which is not described herein again.
Referring to fig. 6, fig. 6 is a schematic block diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 6, the electronic device includes a processor 601, a storage 602, and a communication interface 603. The processor 601, the storage device 602, and the communication interface 603 may be connected by a bus or other means, and fig. 6 shows an example of the connection by the bus in the embodiment of the present application. Wherein the communication interface 603 is controlled by the processor for transceiving messages, the storage means 602 is adapted for storing a computer program comprising program instructions, and the processor 601 is adapted for executing the program instructions stored by the storage means 602. Wherein the processor 601 is configured to invoke the program instructions to perform:
crawling webpage data and extracting HTML texts from the webpage data;
deleting a designated code block in the HTML text to obtain a first text;
replacing the line feed label in the first text with a line feed character to obtain a second text;
deleting the part of the second text except the target attribute value in the content corresponding to the link label to reserve the link in the second text and obtain a third text comprising the link;
and if the link in the third text is detected to be complete, deleting the HTML tag and the continuous line feed character in the third text to obtain the target data after the webpage data are cleaned.
In an optional embodiment, the processor 601 is further configured to determine whether a field corresponding to the first position of the link in the third text is a first preset link field; if yes, judging that the link in the third text is complete; if not, judging that the link in the third text is incomplete.
In an optional embodiment, the processor 601 is further configured to obtain a web page link of a web page corresponding to the web page data, and determine whether a field corresponding to a second position in the web page link is a second preset link field; if so, extracting a network protocol from the webpage link, and performing completion processing on the link based on the network protocol.
In an optional embodiment, the processor 601 is further configured to determine whether a third location corresponding field in the web page link is a third preset link field if it is determined that the second location corresponding field in the web page link is not the second preset link field; if so, counting the occurrence frequency N of the third preset link field in the link of the third text, wherein N is an integer greater than 1; deleting appointed partial fields before and after the last N third preset link fields from the webpage link to obtain a basic link; and performing completion processing on the link based on the basic link.
In an optional embodiment, the processor 601 is specifically configured to filter characters in the HTML text that match a pre-established filtered character list; and deleting the specified code block from the HTML text after the character filtering to obtain a first text.
In an optional embodiment, the processor 601 is specifically configured to screen target content data from the web page data through a regular expression, and extract an HTML text from the target content data.
In an optional embodiment, the processor 601 is further configured to perform a normalization process on a field of a target type in the target data based on a field processing rule.
It should be understood that in the embodiment of the present Application, the Processor 601 may be a Central Processing Unit (CPU), and the Processor 601 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage device 602 may include both read-only memory and random access memory, and provides instructions and data to the processor 601. A portion of the storage device 602 may also include non-volatile random access memory. For example, the storage 602 may also store information of device types.
In specific implementation, the processor 601, the storage device 602, and the communication interface 603 described in this embodiment of the present application may perform the implementation described in the method embodiments of fig. 1 to fig. 4 provided in this embodiment of the present application, and may also perform the implementation of the video call processing device described in this embodiment of the present application, which is not described herein again.
In the embodiment of the application, after the webpage data to be cleaned is crawled, the specified code block is deleted aiming at the HTML text corresponding to the webpage data, namely, unnecessary data such as advertisements and picture introductions at the ends of a document are cleaned. The label with line feed significance in the HTML text of the cleaned webpage data is replaced by a line feed character, so that the webpage data after the data cleaning still keeps the paragraph format of the text content in the original webpage. In addition, for the HTML text of the cleaned webpage data, the part except the target attribute value in the content corresponding to the link tag is deleted to reserve the link of the cleaned webpage data, so that the position of the link in the cleaned webpage data is consistent with that of the original webpage. The convenient medical system can then quickly locate the required content or link from the extracted target content.
In another embodiment of the present application, a computer-readable storage medium is provided, the computer-readable storage medium storing a computer program, the computer program comprising program instructions that, when executed by a processor, implement:
crawling webpage data and extracting HTML texts from the webpage data;
deleting a designated code block in the HTML text to obtain a first text;
replacing the line feed label in the first text with a line feed character to obtain a second text;
deleting the part of the second text except the target attribute value in the content corresponding to the link label to reserve the link in the second text and obtain a third text comprising the link;
and if the link in the third text is detected to be complete, deleting the HTML tag and the continuous line feed character in the third text to obtain the target data after the webpage data are cleaned.
The computer readable storage medium may be an internal storage unit of the electronic device of any of the foregoing embodiments, for example, a hard disk or a memory of the electronic device. The computer readable storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk provided on the electronic device, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the electronic device. The computer-readable storage medium is used for storing a computer program and other programs and data required by the electronic device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, and the program can be stored in a computer readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
While the invention has been described with reference to a number of embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
While the invention has been described with reference to a number of embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A method for data cleansing, comprising:
crawling webpage data and extracting HTML texts from the webpage data;
deleting the designated code block in the HTML text to obtain a first text;
replacing the line feed label in the first text with a line feed character to obtain a second text;
deleting the part of the second text except the target attribute value in the content corresponding to the link label to reserve the link in the second text, so as to obtain a third text comprising the link;
and if the link in the third text is detected to be complete, deleting the HTML tag and the continuous line feed character in the third text to obtain the target data after the webpage data are cleaned.
2. The method of claim 1, further comprising:
judging whether a field corresponding to the first position of the link in the third text is a first preset link field;
if yes, judging that the link in the third text is complete; if not, judging that the link in the third text is incomplete.
3. The method of claim 2, wherein after determining that the link in the third text is incomplete, the method further comprises:
acquiring a webpage link of a webpage corresponding to the webpage data, and judging whether a field corresponding to a second position in the webpage link is a second preset link field;
if so, extracting a network protocol from the webpage link, and performing completion processing on the link based on the network protocol.
4. The method of claim 3, wherein after determining that the link in the third text is incomplete, the method further comprises:
if the field corresponding to the second position in the webpage link is judged not to be the second preset link field, judging whether the field corresponding to the third position in the webpage link is a third preset link field or not;
if so, counting the occurrence frequency N of the third preset link field in the link of the third text, wherein N is an integer greater than 1;
deleting appointed partial fields before and after the last N third preset link fields from the webpage link to obtain a basic link;
and performing completion processing on the link based on the basic link.
5. The method of claim 1, wherein deleting the specified code block in the HTML text to obtain a first text comprises:
filtering characters matched with a pre-established filtering character list in the HTML text;
and deleting the specified code block from the HTML text after the character filtering to obtain a first text.
6. The data cleansing method according to claim 1, wherein extracting HTML text from the web page data comprises:
and screening target content data from the webpage data through a regular expression, and extracting an HTML text from the target content data.
7. The method of claim 1, wherein after deleting the HTML tag and the consecutive line breaks in the third text to obtain the target data after the web page data is cleaned, the method further comprises:
and based on a field processing rule, carrying out standardization processing on the field of the target type in the target data.
8. A data cleansing apparatus, comprising:
the acquisition module is used for crawling webpage data and extracting HTML texts from the webpage data;
the processing module is used for deleting the specified code block in the HTML text to obtain a first text;
the processing module is further configured to replace a line feed label in the first text with a line feed character to obtain a second text, and delete a portion of the second text other than the target attribute value in the content corresponding to the link label, so as to retain the link in the second text, and obtain a third text including the link;
and the processing module is further configured to delete the HTML tag and the continuous line break in the third text to obtain the target data after the webpage data is cleaned if it is detected that the link in the third text is complete.
9. An electronic device, comprising a processor and a storage device, the processor and the storage device being interconnected, wherein the storage device is configured to store a computer program, the computer program comprising program instructions, and wherein the processor is configured to invoke the program instructions to perform the method according to any one of claims 1-7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1-7.
CN202010971243.5A 2020-09-16 2020-09-16 Data cleaning method and related equipment Active CN111931113B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010971243.5A CN111931113B (en) 2020-09-16 2020-09-16 Data cleaning method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010971243.5A CN111931113B (en) 2020-09-16 2020-09-16 Data cleaning method and related equipment

Publications (2)

Publication Number Publication Date
CN111931113A true CN111931113A (en) 2020-11-13
CN111931113B CN111931113B (en) 2021-01-05

Family

ID=73334967

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010971243.5A Active CN111931113B (en) 2020-09-16 2020-09-16 Data cleaning method and related equipment

Country Status (1)

Country Link
CN (1) CN111931113B (en)

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1786947A (en) * 2004-12-07 2006-06-14 国际商业机器公司 System, method and program for extracting web page core content based on web page layout
CN1920815A (en) * 2006-05-09 2007-02-28 上海态格文化传播有限公司 Web page cleaning method based on web page content
CN101452485A (en) * 2008-12-31 2009-06-10 中国建设银行股份有限公司 Method and device for generating multidimensional cubic based on relational database
US20100005112A1 (en) * 2008-07-01 2010-01-07 Sap Ag Html file conversion
CN102591992A (en) * 2012-02-15 2012-07-18 苏州亚新丰信息技术有限公司 Webpage classification identifying system and method based on vertical search and focused crawler technology
CN103425765A (en) * 2013-08-06 2013-12-04 优视科技有限公司 Method and device for extracting webpage text and method and system for webpage preview
CN103440315A (en) * 2013-08-27 2013-12-11 北京工业大学 Web page cleaning method based on theme
CN103927370A (en) * 2014-04-23 2014-07-16 焦点科技股份有限公司 Network information batch acquisition method of combined text and picture information
CN104978325A (en) * 2014-04-03 2015-10-14 腾讯科技(深圳)有限公司 Webpage processing method and device, and user terminal
CN106055722A (en) * 2016-07-26 2016-10-26 重庆兆光科技股份有限公司 Web crawler capturing method and system
CN106453689A (en) * 2016-11-11 2017-02-22 四川长虹电器股份有限公司 Method for extracting and verifying URL (Uniform Resource Locator)
CN109543126A (en) * 2018-11-19 2019-03-29 四川长虹电器股份有限公司 Web page text information extracting method based on block text accounting
CN109657114A (en) * 2018-08-21 2019-04-19 国家计算机网络与信息安全管理中心 A method of extracting webpage semi-structured data
CN110020054A (en) * 2017-12-21 2019-07-16 腾讯科技(深圳)有限公司 Web page contents crawling method, device, computer equipment and storage medium
CN111597292A (en) * 2020-04-20 2020-08-28 安徽慧医信息科技有限公司 Text formatting cleaning method based on webpage label position
CN111639480A (en) * 2020-05-28 2020-09-08 深圳壹账通智能科技有限公司 Text labeling method based on artificial intelligence, electronic device and storage medium

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1786947A (en) * 2004-12-07 2006-06-14 国际商业机器公司 System, method and program for extracting web page core content based on web page layout
CN1920815A (en) * 2006-05-09 2007-02-28 上海态格文化传播有限公司 Web page cleaning method based on web page content
US20100005112A1 (en) * 2008-07-01 2010-01-07 Sap Ag Html file conversion
CN101452485A (en) * 2008-12-31 2009-06-10 中国建设银行股份有限公司 Method and device for generating multidimensional cubic based on relational database
CN102591992A (en) * 2012-02-15 2012-07-18 苏州亚新丰信息技术有限公司 Webpage classification identifying system and method based on vertical search and focused crawler technology
CN103425765A (en) * 2013-08-06 2013-12-04 优视科技有限公司 Method and device for extracting webpage text and method and system for webpage preview
CN103440315A (en) * 2013-08-27 2013-12-11 北京工业大学 Web page cleaning method based on theme
CN104978325A (en) * 2014-04-03 2015-10-14 腾讯科技(深圳)有限公司 Webpage processing method and device, and user terminal
CN103927370A (en) * 2014-04-23 2014-07-16 焦点科技股份有限公司 Network information batch acquisition method of combined text and picture information
CN106055722A (en) * 2016-07-26 2016-10-26 重庆兆光科技股份有限公司 Web crawler capturing method and system
CN106453689A (en) * 2016-11-11 2017-02-22 四川长虹电器股份有限公司 Method for extracting and verifying URL (Uniform Resource Locator)
CN110020054A (en) * 2017-12-21 2019-07-16 腾讯科技(深圳)有限公司 Web page contents crawling method, device, computer equipment and storage medium
CN109657114A (en) * 2018-08-21 2019-04-19 国家计算机网络与信息安全管理中心 A method of extracting webpage semi-structured data
CN109543126A (en) * 2018-11-19 2019-03-29 四川长虹电器股份有限公司 Web page text information extracting method based on block text accounting
CN111597292A (en) * 2020-04-20 2020-08-28 安徽慧医信息科技有限公司 Text formatting cleaning method based on webpage label position
CN111639480A (en) * 2020-05-28 2020-09-08 深圳壹账通智能科技有限公司 Text labeling method based on artificial intelligence, electronic device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
昊天SEO: "python 爬虫 过滤全部htm标签 提取正文内容", 《HTTPS://WWW.168SEO.CN/PYTHON/24873.HTML》 *

Also Published As

Publication number Publication date
CN111931113B (en) 2021-01-05

Similar Documents

Publication Publication Date Title
CN109508191B (en) Code generation method and system
CN112118232B (en) Message protocol analysis method and device
CN107391675B (en) Method and apparatus for generating structured information
CN110333863B (en) Method and device for generating and displaying applet page
CN109376291B (en) Website fingerprint information scanning method and device based on web crawler
CN110851681B (en) Crawler processing method, crawler processing device, server and computer readable storage medium
CN105205080B (en) Redundant file method for cleaning, device and system
US10755091B2 (en) Method and apparatus for retrieving image-text block from web page
CN109241391A (en) A kind of anti-crawler method climbed of solution font
CN112232075A (en) Article release time identification method based on time format and webpage element characteristics
CN111984262A (en) WeChat cascading style sheet file processing method, device, equipment and storage medium
CN107368500B (en) Data extraction method and system
CN109542501B (en) Browser table compatibility method and device, computer equipment and storage medium
CN113360106B (en) Webpage printing method and device
CN111125485A (en) Website URL crawling method based on Scapy
CN111931113B (en) Data cleaning method and related equipment
CN108073589B (en) Method and device for acquiring webpage elements
WO2019071899A1 (en) Electronic device, vehicle data import method and storage medium
CN112328246A (en) Page component generation method and device, computer equipment and storage medium
CN106897287B (en) Webpage release time extraction method and device for webpage release time extraction
CN109492146B (en) Method and device for preventing WEB crawler
CN111126058A (en) Text information automatic extraction method and device, readable storage medium and electronic equipment
CN111833219A (en) Method and device for providing intellectual property service commodity data
CN111444456B (en) Style editing method and device and electronic equipment
CN111966930B (en) Webpage list analyzing method and system based on XPath sequence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant