CN109492201A

CN109492201A - Document format conversion method applied to magnitude comparison

Info

Publication number: CN109492201A
Application number: CN201811327145.7A
Authority: CN
Inventors: 童友俊
Original assignee: Dalian Han Wen Information Co Ltd
Current assignee: Dalian Han Wen Information Co Ltd
Priority date: 2018-11-08
Filing date: 2018-11-08
Publication date: 2019-03-19

Abstract

The present invention provides a kind of document format conversion method applied to magnitude comparison.The method of the present invention, comprising: web page files are parsed, critical first data is therefrom extracted, processing is then grouped to it；The first data that grouping obtains are cleaned, the second more terse data are extracted；The second data storage extracted to amount of ram is less than original value, and can intuitively be observed in the file of data；Batch storage processing is carried out to the file, is compared for magnitude.Method whole flow process proposed by the present invention suffers from fast and efficiently advantage, solves the technical issues of in the prior art can not directly showing the file of original form web page with csv format.Realize the format conversion techniques effect quickly handled mass data file.

Description

Document format conversion method applied to magnitude comparison

Technical field

The present invention relates to web page files processing technology fields, specifically, more particularly to it is a kind of applied to magnitude comparison Document format conversion method.

Background technique

Internet is being that global industry development builds completely new development and operational mode, push Industrial Organization Mode, Service mode and business model total innovation accelerate industrial transformation upgrading.HyperText Markup Language document HTML is network All-purpose language, a kind of simple, general full tagging language.It allows what webpage making people established that text combines with picture to answer The miscellaneous page, these pages can be browsed to by any other online people, no matter use what kind of computer or browser, But the data file downloaded on existing website is that html file is not easy to post-processing.

Summary of the invention

According to technical problem set forth above, and provide a kind of document format conversion method applied to magnitude comparison.

The technological means that the present invention uses is as follows:

A kind of document format conversion method applied to magnitude comparison, includes the following steps:

Read the essential information of web page files；

Dissection process is carried out to web page files, therefrom extracts the first data of demand；

First data are subjected to data cleansing, extract the second more terse data；

The second data storage of extraction is less than original value to amount of ram, and can intuitively be observed in the file of data；

Batch storage processing is carried out to the file, is compared for magnitude.

Further, in the step S1, HTML code that web page files are made of a string of character strings.

It is further, described that dissection process is carried out to web page files specifically:

Create the tag location variable for extracting preset data in HTML；

Judge that the initial data to be extracted whether there is the nesting of label, then carry out layering and matching if there is nesting, Successively matching is until being matched to text label；If there is no nesting, then direct matched text label；

The text label of successful match is returned into a tuple container, the tuple container is by root node and ID dictionary structure At dictionary key is the attribute value of the ID of definition, and the ID reference of element is stored as the value of dictionary；

Initial data is extracted to obtain first data according to the key of dictionary and value.

Further, the data cleansing specifically: parse html document from character string constant, resolved to tree-shaped Node, the hierarchical relational node of node, father node, child node indicate that each node contains corresponding label, passes through character string Matching navigates to tag element, and label and Text Feature Extraction are come out, a variable is assigned to, and then deletes and provides from tree or subtree The all elements and their attribute of title are marked, the content of text for deleting element and daughter element are then merged into its father's member In element, the coded string form of expression that element sequence turns to its XML tree is defaulted as ASCII coding, coding is changed to non- UTF-8 compatible coding and by default enabling statement, are then put into receptacle for the element extracted, pass through dictionary ID It is extracted, is cleared up.

Further, during layer-by-layer processing, the label of label is clear for the label node with nest relation It removes.

The present invention also provides a kind of storage medium, program is preserved on the storage medium, when described program is run The method in execution.

The present invention also provides a kind of processors, and above-mentioned method is executed when described program is run.

Compared with the prior art, the invention has the following advantages that

Method whole flow process proposed by the present invention suffers from fast and efficiently advantage, and solving can not incite somebody to action in the prior art The technical issues of file of original form web page is directly showed with csv format.It realizes and mass data file is quickly located The format conversion techniques effect of reason.The present invention has the effect of data cleansing, facilitates more intuitive access, the mistake of format conversion Cheng Zhong, byte number shared by data become smaller, so that overall memory becomes smaller, store and be more easier, and help to save space.This File can be carried out batch process operations by method, convenient for the management of file.

The present invention can be widely popularized in web page files process field based on the above reasons.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to do simply to introduce, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without any creative labor, can be with It obtains other drawings based on these drawings.

Fig. 1 is a kind of document format conversion method flow diagram applied to magnitude comparison of the present invention.

Specific embodiment

In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.

It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.

The present invention is can quickly and effectively to handle data applied to the document format conversion method of magnitude comparison, is improved The accuracy of data.Before doing quantity value comparison, the conversion and processing of file format are most important, and how quickly high it decides The processing data of effect, determine the accuracy of this data.

As shown in Figure 1, the present invention provides a kind of document format conversion methods applied to magnitude comparison, including walk as follows It is rapid:

Due to before analyzing web page, hardly resulting in intuitive data, bigger, processing is influenced on subsequent operation processing Quality be difficult to be guaranteed.Efficiency and the speed underground for handling data, can waste the cost of long time and work.

Therefore the essential information by reading web page files；

If accurately cleaning and handling since data cannot be reached when processing data, to next result meeting Have a great impact, so that making the accuracy of result reduces.By the way that first data are carried out data cleansing, extract simpler The second experienced data；

Batch storage processing is carried out to the file, is compared for magnitude.

In the step S1, HTML code that web page files are made of a string of character strings.Source file is made of HTML Code file, HTML code is made of a large amount of label.There is the characteristic of nested layering and attributive classification by label, lead to It crosses one expression formula of creation and finds this label to position, the data with similar label are then put into a receptacle, it is right The data of this receptacle carry out cleaning arrangement, then extract, this data is written in the formatted file for wanting conversion.

It is described that dissection process is carried out to web page files specifically:

Create the tag location variable for extracting preset data in HTML；

The text label of successful match is returned into a tuple container (i.e. root node, ID dictionary), the tuple container by Root node and ID dictionary are constituted, and dictionary key is the attribute value of the ID of definition, and the ID reference of element is stored as the value of dictionary；

This tuple can parse the resolver as file different by its resolver keyword parameter Object.

The data cleansing specifically: parse html document from character string constant, resolved to tree-shaped node, node Hierarchical relational node, father node, child node indicates that each node contains corresponding label, position by string matching To tag element, label and Text Feature Extraction are come out, are assigned to a variable, then deletes from tree or subtree and label title is provided All elements and their attribute, then the content of text for deleting element and daughter element are merged into its father's element, will The coded string form of expression that element sequence turns to its XML tree is defaulted as ASCII coding, and it is simultaneous that coding is changed to non-UTF-8 Hold coding and enable statement by default, the element extracted is then put into receptacle, is mentioned by dictionary ID It takes, is cleared up.

For the label node with nest relation, during layer-by-layer processing, the label of label is removed, otherwise can Data are had an impact, data is made to become chaotic.

In the present embodiment, a kind of storage medium is provided, storage medium includes the program of storage, wherein is transported in program Equipment executes the above method where controlling storage medium when row.

In the present embodiment, a kind of processor is provided, processor includes the program of processor, wherein run in program When control processor where equipment execute the above method.

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment The part of detailed description, reference can be made to the related descriptions of other embodiments.

The functional units in various embodiments of the present invention may be integrated into one processing unit, is also possible to each Unit physically exists alone, and can also be integrated in one unit with two or more units.Above-mentioned integrated unit both may be used To use formal implementation of hardware, can also realize in the form of software functional units.

It, can if integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product To be stored in a computer readable storage medium.Based on this understanding, technical solution of the present invention substantially or Say that all or part of the part that contributes to existing technology or the technical solution can embody in the form of software products Out, which is stored in a storage medium, including some instructions are used so that a computer equipment (can be personal computer, server or network equipment etc.) executes all or part of step of each embodiment method of the present invention Suddenly.And storage medium above-mentioned includes: USB flash disk, read-only memory (ROM, Read-Only Memory), random access memory The various media that can store program code such as (RAM, Random Access Memory), mobile hard disk, magnetic or disk.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations；To the greatest extent Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement；And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme.

Claims

1. a kind of document format conversion method applied to magnitude comparison, which comprises the steps of:

Read the essential information of web page files；

Batch storage processing is carried out to the file, is compared for magnitude.

2. the document format conversion method according to claim 1 applied to magnitude comparison, which is characterized in that the step In S1, HTML code that web page files are made of a string of character strings.

3. the document format conversion method according to claim 2 applied to magnitude comparison, which is characterized in that described pair of net Page file carries out dissection process specifically:

Create the tag location variable for extracting preset data in HTML；

Judge that the initial data to be extracted whether there is the nesting of label, then carries out layering and matching if there is nesting, successively Matching is until being matched to text label；If there is no nesting, then direct matched text label；

The text label of successful match is returned into a tuple container, the tuple container is made of root node and ID dictionary, word Allusion quotation key is the attribute value of the ID of definition, and the ID reference of element is stored as the value of dictionary；

4. the document format conversion method according to claim 3 applied to magnitude comparison, which is characterized in that the data Cleaning specifically: parse html document from character string constant, resolved to tree-shaped node, the hierarchical relational section of node Point, father node, child node indicate that each node contains corresponding label, navigates to tag element by string matching, will mark Label and Text Feature Extraction come out, be assigned to a variable, then delete from tree or subtree provide label title all elements and it Attribute, then the content of text for deleting element and daughter element are merged into its father's element, element sequence is turned to it The coded string form of expression of XML tree is defaulted as ASCII coding, and coding is changed to non-UTF-8 compatible coding and is being defaulted In the case of enable statement, the element extracted is then put into receptacle, is extracted, is cleared up by dictionary ID.

5. the document format conversion method according to claim 3 applied to magnitude comparison, which is characterized in that for having The label node of nest relation removes the label of label during layer-by-layer processing.

6. a kind of storage medium, which is characterized in that preserve program on the storage medium, right of execution when described program is run Benefit require any one of 1 to 5 described in method.

7. a kind of processor, which is characterized in that side described in any one of perform claim requirement 1 to 5 when described program is run Method.