CN109492201A - Document format conversion method applied to magnitude comparison - Google Patents

Document format conversion method applied to magnitude comparison Download PDF

Info

Publication number
CN109492201A
CN109492201A CN201811327145.7A CN201811327145A CN109492201A CN 109492201 A CN109492201 A CN 109492201A CN 201811327145 A CN201811327145 A CN 201811327145A CN 109492201 A CN109492201 A CN 109492201A
Authority
CN
China
Prior art keywords
data
label
format conversion
file
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811327145.7A
Other languages
Chinese (zh)
Inventor
童友俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Han Wen Information Co Ltd
Original Assignee
Dalian Han Wen Information Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Han Wen Information Co Ltd filed Critical Dalian Han Wen Information Co Ltd
Priority to CN201811327145.7A priority Critical patent/CN109492201A/en
Publication of CN109492201A publication Critical patent/CN109492201A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • G06F40/154Tree transformation for tree-structured or markup documents, e.g. XSLT, XSL-FO or stylesheets

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The present invention provides a kind of document format conversion method applied to magnitude comparison.The method of the present invention, comprising: web page files are parsed, critical first data is therefrom extracted, processing is then grouped to it;The first data that grouping obtains are cleaned, the second more terse data are extracted;The second data storage extracted to amount of ram is less than original value, and can intuitively be observed in the file of data;Batch storage processing is carried out to the file, is compared for magnitude.Method whole flow process proposed by the present invention suffers from fast and efficiently advantage, solves the technical issues of in the prior art can not directly showing the file of original form web page with csv format.Realize the format conversion techniques effect quickly handled mass data file.

Description

Document format conversion method applied to magnitude comparison
Technical field
The present invention relates to web page files processing technology fields, specifically, more particularly to it is a kind of applied to magnitude comparison Document format conversion method.
Background technique
Internet is being that global industry development builds completely new development and operational mode, push Industrial Organization Mode, Service mode and business model total innovation accelerate industrial transformation upgrading.HyperText Markup Language document HTML is network All-purpose language, a kind of simple, general full tagging language.It allows what webpage making people established that text combines with picture to answer The miscellaneous page, these pages can be browsed to by any other online people, no matter use what kind of computer or browser, But the data file downloaded on existing website is that html file is not easy to post-processing.
Summary of the invention
According to technical problem set forth above, and provide a kind of document format conversion method applied to magnitude comparison.
The technological means that the present invention uses is as follows:
A kind of document format conversion method applied to magnitude comparison, includes the following steps:
Read the essential information of web page files;
Dissection process is carried out to web page files, therefrom extracts the first data of demand;
First data are subjected to data cleansing, extract the second more terse data;
The second data storage of extraction is less than original value to amount of ram, and can intuitively be observed in the file of data;
Batch storage processing is carried out to the file, is compared for magnitude.
Further, in the step S1, HTML code that web page files are made of a string of character strings.
It is further, described that dissection process is carried out to web page files specifically:
Create the tag location variable for extracting preset data in HTML;
Judge that the initial data to be extracted whether there is the nesting of label, then carry out layering and matching if there is nesting, Successively matching is until being matched to text label;If there is no nesting, then direct matched text label;
The text label of successful match is returned into a tuple container, the tuple container is by root node and ID dictionary structure At dictionary key is the attribute value of the ID of definition, and the ID reference of element is stored as the value of dictionary;
Initial data is extracted to obtain first data according to the key of dictionary and value.
Further, the data cleansing specifically: parse html document from character string constant, resolved to tree-shaped Node, the hierarchical relational node of node, father node, child node indicate that each node contains corresponding label, passes through character string Matching navigates to tag element, and label and Text Feature Extraction are come out, a variable is assigned to, and then deletes and provides from tree or subtree The all elements and their attribute of title are marked, the content of text for deleting element and daughter element are then merged into its father's member In element, the coded string form of expression that element sequence turns to its XML tree is defaulted as ASCII coding, coding is changed to non- UTF-8 compatible coding and by default enabling statement, are then put into receptacle for the element extracted, pass through dictionary ID It is extracted, is cleared up.
Further, during layer-by-layer processing, the label of label is clear for the label node with nest relation It removes.
The present invention also provides a kind of storage medium, program is preserved on the storage medium, when described program is run The method in execution.
The present invention also provides a kind of processors, and above-mentioned method is executed when described program is run.
Compared with the prior art, the invention has the following advantages that
Method whole flow process proposed by the present invention suffers from fast and efficiently advantage, and solving can not incite somebody to action in the prior art The technical issues of file of original form web page is directly showed with csv format.It realizes and mass data file is quickly located The format conversion techniques effect of reason.The present invention has the effect of data cleansing, facilitates more intuitive access, the mistake of format conversion Cheng Zhong, byte number shared by data become smaller, so that overall memory becomes smaller, store and be more easier, and help to save space.This File can be carried out batch process operations by method, convenient for the management of file.
The present invention can be widely popularized in web page files process field based on the above reasons.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to do simply to introduce, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without any creative labor, can be with It obtains other drawings based on these drawings.
Fig. 1 is a kind of document format conversion method flow diagram applied to magnitude comparison of the present invention.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.
The present invention is can quickly and effectively to handle data applied to the document format conversion method of magnitude comparison, is improved The accuracy of data.Before doing quantity value comparison, the conversion and processing of file format are most important, and how quickly high it decides The processing data of effect, determine the accuracy of this data.
As shown in Figure 1, the present invention provides a kind of document format conversion methods applied to magnitude comparison, including walk as follows It is rapid:
Due to before analyzing web page, hardly resulting in intuitive data, bigger, processing is influenced on subsequent operation processing Quality be difficult to be guaranteed.Efficiency and the speed underground for handling data, can waste the cost of long time and work.
Therefore the essential information by reading web page files;
Dissection process is carried out to web page files, therefrom extracts the first data of demand;
If accurately cleaning and handling since data cannot be reached when processing data, to next result meeting Have a great impact, so that making the accuracy of result reduces.By the way that first data are carried out data cleansing, extract simpler The second experienced data;
The second data storage of extraction is less than original value to amount of ram, and can intuitively be observed in the file of data;
Batch storage processing is carried out to the file, is compared for magnitude.
In the step S1, HTML code that web page files are made of a string of character strings.Source file is made of HTML Code file, HTML code is made of a large amount of label.There is the characteristic of nested layering and attributive classification by label, lead to It crosses one expression formula of creation and finds this label to position, the data with similar label are then put into a receptacle, it is right The data of this receptacle carry out cleaning arrangement, then extract, this data is written in the formatted file for wanting conversion.
It is described that dissection process is carried out to web page files specifically:
Create the tag location variable for extracting preset data in HTML;
Judge that the initial data to be extracted whether there is the nesting of label, then carry out layering and matching if there is nesting, Successively matching is until being matched to text label;If there is no nesting, then direct matched text label;
The text label of successful match is returned into a tuple container (i.e. root node, ID dictionary), the tuple container by Root node and ID dictionary are constituted, and dictionary key is the attribute value of the ID of definition, and the ID reference of element is stored as the value of dictionary;
This tuple can parse the resolver as file different by its resolver keyword parameter Object.
Initial data is extracted to obtain first data according to the key of dictionary and value.
The data cleansing specifically: parse html document from character string constant, resolved to tree-shaped node, node Hierarchical relational node, father node, child node indicates that each node contains corresponding label, position by string matching To tag element, label and Text Feature Extraction are come out, are assigned to a variable, then deletes from tree or subtree and label title is provided All elements and their attribute, then the content of text for deleting element and daughter element are merged into its father's element, will The coded string form of expression that element sequence turns to its XML tree is defaulted as ASCII coding, and it is simultaneous that coding is changed to non-UTF-8 Hold coding and enable statement by default, the element extracted is then put into receptacle, is mentioned by dictionary ID It takes, is cleared up.
For the label node with nest relation, during layer-by-layer processing, the label of label is removed, otherwise can Data are had an impact, data is made to become chaotic.
In the present embodiment, a kind of storage medium is provided, storage medium includes the program of storage, wherein is transported in program Equipment executes the above method where controlling storage medium when row.
In the present embodiment, a kind of processor is provided, processor includes the program of processor, wherein run in program When control processor where equipment execute the above method.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment The part of detailed description, reference can be made to the related descriptions of other embodiments.
The functional units in various embodiments of the present invention may be integrated into one processing unit, is also possible to each Unit physically exists alone, and can also be integrated in one unit with two or more units.Above-mentioned integrated unit both may be used To use formal implementation of hardware, can also realize in the form of software functional units.
It, can if integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product To be stored in a computer readable storage medium.Based on this understanding, technical solution of the present invention substantially or Say that all or part of the part that contributes to existing technology or the technical solution can embody in the form of software products Out, which is stored in a storage medium, including some instructions are used so that a computer equipment (can be personal computer, server or network equipment etc.) executes all or part of step of each embodiment method of the present invention Suddenly.And storage medium above-mentioned includes: USB flash disk, read-only memory (ROM, Read-Only Memory), random access memory The various media that can store program code such as (RAM, Random Access Memory), mobile hard disk, magnetic or disk.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement;And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme.

Claims (7)

1. a kind of document format conversion method applied to magnitude comparison, which comprises the steps of:
Read the essential information of web page files;
Dissection process is carried out to web page files, therefrom extracts the first data of demand;
First data are subjected to data cleansing, extract the second more terse data;
The second data storage of extraction is less than original value to amount of ram, and can intuitively be observed in the file of data;
Batch storage processing is carried out to the file, is compared for magnitude.
2. the document format conversion method according to claim 1 applied to magnitude comparison, which is characterized in that the step In S1, HTML code that web page files are made of a string of character strings.
3. the document format conversion method according to claim 2 applied to magnitude comparison, which is characterized in that described pair of net Page file carries out dissection process specifically:
Create the tag location variable for extracting preset data in HTML;
Judge that the initial data to be extracted whether there is the nesting of label, then carries out layering and matching if there is nesting, successively Matching is until being matched to text label;If there is no nesting, then direct matched text label;
The text label of successful match is returned into a tuple container, the tuple container is made of root node and ID dictionary, word Allusion quotation key is the attribute value of the ID of definition, and the ID reference of element is stored as the value of dictionary;
Initial data is extracted to obtain first data according to the key of dictionary and value.
4. the document format conversion method according to claim 3 applied to magnitude comparison, which is characterized in that the data Cleaning specifically: parse html document from character string constant, resolved to tree-shaped node, the hierarchical relational section of node Point, father node, child node indicate that each node contains corresponding label, navigates to tag element by string matching, will mark Label and Text Feature Extraction come out, be assigned to a variable, then delete from tree or subtree provide label title all elements and it Attribute, then the content of text for deleting element and daughter element are merged into its father's element, element sequence is turned to it The coded string form of expression of XML tree is defaulted as ASCII coding, and coding is changed to non-UTF-8 compatible coding and is being defaulted In the case of enable statement, the element extracted is then put into receptacle, is extracted, is cleared up by dictionary ID.
5. the document format conversion method according to claim 3 applied to magnitude comparison, which is characterized in that for having The label node of nest relation removes the label of label during layer-by-layer processing.
6. a kind of storage medium, which is characterized in that preserve program on the storage medium, right of execution when described program is run Benefit require any one of 1 to 5 described in method.
7. a kind of processor, which is characterized in that side described in any one of perform claim requirement 1 to 5 when described program is run Method.
CN201811327145.7A 2018-11-08 2018-11-08 Document format conversion method applied to magnitude comparison Pending CN109492201A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811327145.7A CN109492201A (en) 2018-11-08 2018-11-08 Document format conversion method applied to magnitude comparison

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811327145.7A CN109492201A (en) 2018-11-08 2018-11-08 Document format conversion method applied to magnitude comparison

Publications (1)

Publication Number Publication Date
CN109492201A true CN109492201A (en) 2019-03-19

Family

ID=65694104

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811327145.7A Pending CN109492201A (en) 2018-11-08 2018-11-08 Document format conversion method applied to magnitude comparison

Country Status (1)

Country Link
CN (1) CN109492201A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125024A (en) * 2019-11-29 2020-05-08 浪潮电子信息产业股份有限公司 Method, device, equipment and storage medium for deleting distributed system files
CN116127079A (en) * 2023-04-20 2023-05-16 中电科大数据研究院有限公司 Text classification method
US12001397B2 (en) 2019-11-29 2024-06-04 Inspur Electronic Information Industry Co., Ltd. Method, apparatus and device for deleting distributed system file, and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008102773A (en) * 2006-10-19 2008-05-01 R & W:Kk Method for converting data into common format
CN103870487A (en) * 2012-12-13 2014-06-18 腾讯科技(深圳)有限公司 Webpage file processing method and mobile terminal
CN108446136A (en) * 2018-03-22 2018-08-24 北京焦点新干线信息技术有限公司 A kind of extracting method and system of element code

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008102773A (en) * 2006-10-19 2008-05-01 R & W:Kk Method for converting data into common format
CN103870487A (en) * 2012-12-13 2014-06-18 腾讯科技(深圳)有限公司 Webpage file processing method and mobile terminal
CN108446136A (en) * 2018-03-22 2018-08-24 北京焦点新干线信息技术有限公司 A kind of extracting method and system of element code

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125024A (en) * 2019-11-29 2020-05-08 浪潮电子信息产业股份有限公司 Method, device, equipment and storage medium for deleting distributed system files
CN111125024B (en) * 2019-11-29 2022-05-24 浪潮电子信息产业股份有限公司 Method, device, equipment and storage medium for deleting distributed system files
US12001397B2 (en) 2019-11-29 2024-06-04 Inspur Electronic Information Industry Co., Ltd. Method, apparatus and device for deleting distributed system file, and storage medium
CN116127079A (en) * 2023-04-20 2023-05-16 中电科大数据研究院有限公司 Text classification method
CN116127079B (en) * 2023-04-20 2023-06-20 中电科大数据研究院有限公司 Text classification method

Similar Documents

Publication Publication Date Title
CN109062874A (en) Acquisition methods, terminal device and the medium of financial data
US11615246B2 (en) Data-driven structure extraction from text documents
US20220197923A1 (en) Apparatus and method for building big data on unstructured cyber threat information and method for analyzing unstructured cyber threat information
CN113807098A (en) Model training method and device, electronic equipment and storage medium
CN100444591C (en) Method for acquiring front-page keyword and its application system
CN110427614B (en) Construction method and device of paragraph level, electronic equipment and storage medium
CN112732945B (en) Standard knowledge graph construction and standard query method and device
CN107590291A (en) A kind of searching method of picture, terminal device and storage medium
CN111857660B (en) Context-aware API recommendation method and terminal based on query statement
CN109492201A (en) Document format conversion method applied to magnitude comparison
CN112818121A (en) Text classification method and device, computer equipment and storage medium
US9563847B2 (en) Apparatus and method for building and using inference engines based on representations of data that preserve relationships between objects
CN111709225A (en) Event cause and effect relationship judging method and device and computer readable storage medium
CN111444368B (en) Method and device for constructing user portrait, computer equipment and storage medium
CN112115266A (en) Malicious website classification method and device, computer equipment and readable storage medium
Jayaraj et al. Augmenting efficiency of recruitment process using IRCF text mining algorithm
CN113434627A (en) Work order processing method and device and computer readable storage medium
CN114282119B (en) Scientific and technological information resource retrieval method and system based on heterogeneous information network
CN110874398B (en) Forbidden word processing method and device, electronic equipment and storage medium
CN113254583A (en) Document marking method, device and medium based on semantic vector
CN115481240A (en) Data asset quality detection method and detection device
Carme et al. The lixto project: Exploring new frontiers of web data extraction
CN112149022A (en) Heuristic rule-based webpage content extraction method and system
JP6763967B2 (en) Data conversion device and data conversion method
Vasiliev Python for Data Science: A Hands-on Introduction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190319