CN109492201A - Document format conversion method applied to magnitude comparison - Google Patents
Document format conversion method applied to magnitude comparison Download PDFInfo
- Publication number
- CN109492201A CN109492201A CN201811327145.7A CN201811327145A CN109492201A CN 109492201 A CN109492201 A CN 109492201A CN 201811327145 A CN201811327145 A CN 201811327145A CN 109492201 A CN109492201 A CN 109492201A
- Authority
- CN
- China
- Prior art keywords
- data
- label
- format conversion
- file
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 19
- 238000012545 processing Methods 0.000 claims abstract description 15
- 238000003860 storage Methods 0.000 claims abstract description 15
- 230000008569 process Effects 0.000 claims abstract description 11
- 230000008901 benefit Effects 0.000 claims abstract description 5
- 238000013500 data storage Methods 0.000 claims abstract description 4
- 239000000284 extract Substances 0.000 claims description 7
- 238000002224 dissection Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 6
- 238000004140 cleaning Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 4
- 230000009471 action Effects 0.000 description 1
- 238000010923 batch production Methods 0.000 description 1
- 230000000739 chaotic effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
- G06F40/154—Tree transformation for tree-structured or markup documents, e.g. XSLT, XSL-FO or stylesheets
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Document Processing Apparatus (AREA)
Abstract
The present invention provides a kind of document format conversion method applied to magnitude comparison.The method of the present invention, comprising: web page files are parsed, critical first data is therefrom extracted, processing is then grouped to it;The first data that grouping obtains are cleaned, the second more terse data are extracted;The second data storage extracted to amount of ram is less than original value, and can intuitively be observed in the file of data;Batch storage processing is carried out to the file, is compared for magnitude.Method whole flow process proposed by the present invention suffers from fast and efficiently advantage, solves the technical issues of in the prior art can not directly showing the file of original form web page with csv format.Realize the format conversion techniques effect quickly handled mass data file.
Description
Technical field
The present invention relates to web page files processing technology fields, specifically, more particularly to it is a kind of applied to magnitude comparison
Document format conversion method.
Background technique
Internet is being that global industry development builds completely new development and operational mode, push Industrial Organization Mode,
Service mode and business model total innovation accelerate industrial transformation upgrading.HyperText Markup Language document HTML is network
All-purpose language, a kind of simple, general full tagging language.It allows what webpage making people established that text combines with picture to answer
The miscellaneous page, these pages can be browsed to by any other online people, no matter use what kind of computer or browser,
But the data file downloaded on existing website is that html file is not easy to post-processing.
Summary of the invention
According to technical problem set forth above, and provide a kind of document format conversion method applied to magnitude comparison.
The technological means that the present invention uses is as follows:
A kind of document format conversion method applied to magnitude comparison, includes the following steps:
Read the essential information of web page files;
Dissection process is carried out to web page files, therefrom extracts the first data of demand;
First data are subjected to data cleansing, extract the second more terse data;
The second data storage of extraction is less than original value to amount of ram, and can intuitively be observed in the file of data;
Batch storage processing is carried out to the file, is compared for magnitude.
Further, in the step S1, HTML code that web page files are made of a string of character strings.
It is further, described that dissection process is carried out to web page files specifically:
Create the tag location variable for extracting preset data in HTML;
Judge that the initial data to be extracted whether there is the nesting of label, then carry out layering and matching if there is nesting,
Successively matching is until being matched to text label;If there is no nesting, then direct matched text label;
The text label of successful match is returned into a tuple container, the tuple container is by root node and ID dictionary structure
At dictionary key is the attribute value of the ID of definition, and the ID reference of element is stored as the value of dictionary;
Initial data is extracted to obtain first data according to the key of dictionary and value.
Further, the data cleansing specifically: parse html document from character string constant, resolved to tree-shaped
Node, the hierarchical relational node of node, father node, child node indicate that each node contains corresponding label, passes through character string
Matching navigates to tag element, and label and Text Feature Extraction are come out, a variable is assigned to, and then deletes and provides from tree or subtree
The all elements and their attribute of title are marked, the content of text for deleting element and daughter element are then merged into its father's member
In element, the coded string form of expression that element sequence turns to its XML tree is defaulted as ASCII coding, coding is changed to non-
UTF-8 compatible coding and by default enabling statement, are then put into receptacle for the element extracted, pass through dictionary ID
It is extracted, is cleared up.
Further, during layer-by-layer processing, the label of label is clear for the label node with nest relation
It removes.
The present invention also provides a kind of storage medium, program is preserved on the storage medium, when described program is run
The method in execution.
The present invention also provides a kind of processors, and above-mentioned method is executed when described program is run.
Compared with the prior art, the invention has the following advantages that
Method whole flow process proposed by the present invention suffers from fast and efficiently advantage, and solving can not incite somebody to action in the prior art
The technical issues of file of original form web page is directly showed with csv format.It realizes and mass data file is quickly located
The format conversion techniques effect of reason.The present invention has the effect of data cleansing, facilitates more intuitive access, the mistake of format conversion
Cheng Zhong, byte number shared by data become smaller, so that overall memory becomes smaller, store and be more easier, and help to save space.This
File can be carried out batch process operations by method, convenient for the management of file.
The present invention can be widely popularized in web page files process field based on the above reasons.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to do simply to introduce, it should be apparent that, the accompanying drawings in the following description is this hair
Bright some embodiments for those of ordinary skill in the art without any creative labor, can be with
It obtains other drawings based on these drawings.
Fig. 1 is a kind of document format conversion method flow diagram applied to magnitude comparison of the present invention.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention
Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only
The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people
The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work
It encloses.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, "
Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way
Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or
Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover
Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to
Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product
Or other step or units that equipment is intrinsic.
The present invention is can quickly and effectively to handle data applied to the document format conversion method of magnitude comparison, is improved
The accuracy of data.Before doing quantity value comparison, the conversion and processing of file format are most important, and how quickly high it decides
The processing data of effect, determine the accuracy of this data.
As shown in Figure 1, the present invention provides a kind of document format conversion methods applied to magnitude comparison, including walk as follows
It is rapid:
Due to before analyzing web page, hardly resulting in intuitive data, bigger, processing is influenced on subsequent operation processing
Quality be difficult to be guaranteed.Efficiency and the speed underground for handling data, can waste the cost of long time and work.
Therefore the essential information by reading web page files;
Dissection process is carried out to web page files, therefrom extracts the first data of demand;
If accurately cleaning and handling since data cannot be reached when processing data, to next result meeting
Have a great impact, so that making the accuracy of result reduces.By the way that first data are carried out data cleansing, extract simpler
The second experienced data;
The second data storage of extraction is less than original value to amount of ram, and can intuitively be observed in the file of data;
Batch storage processing is carried out to the file, is compared for magnitude.
In the step S1, HTML code that web page files are made of a string of character strings.Source file is made of HTML
Code file, HTML code is made of a large amount of label.There is the characteristic of nested layering and attributive classification by label, lead to
It crosses one expression formula of creation and finds this label to position, the data with similar label are then put into a receptacle, it is right
The data of this receptacle carry out cleaning arrangement, then extract, this data is written in the formatted file for wanting conversion.
It is described that dissection process is carried out to web page files specifically:
Create the tag location variable for extracting preset data in HTML;
Judge that the initial data to be extracted whether there is the nesting of label, then carry out layering and matching if there is nesting,
Successively matching is until being matched to text label;If there is no nesting, then direct matched text label;
The text label of successful match is returned into a tuple container (i.e. root node, ID dictionary), the tuple container by
Root node and ID dictionary are constituted, and dictionary key is the attribute value of the ID of definition, and the ID reference of element is stored as the value of dictionary;
This tuple can parse the resolver as file different by its resolver keyword parameter
Object.
Initial data is extracted to obtain first data according to the key of dictionary and value.
The data cleansing specifically: parse html document from character string constant, resolved to tree-shaped node, node
Hierarchical relational node, father node, child node indicates that each node contains corresponding label, position by string matching
To tag element, label and Text Feature Extraction are come out, are assigned to a variable, then deletes from tree or subtree and label title is provided
All elements and their attribute, then the content of text for deleting element and daughter element are merged into its father's element, will
The coded string form of expression that element sequence turns to its XML tree is defaulted as ASCII coding, and it is simultaneous that coding is changed to non-UTF-8
Hold coding and enable statement by default, the element extracted is then put into receptacle, is mentioned by dictionary ID
It takes, is cleared up.
For the label node with nest relation, during layer-by-layer processing, the label of label is removed, otherwise can
Data are had an impact, data is made to become chaotic.
In the present embodiment, a kind of storage medium is provided, storage medium includes the program of storage, wherein is transported in program
Equipment executes the above method where controlling storage medium when row.
In the present embodiment, a kind of processor is provided, processor includes the program of processor, wherein run in program
When control processor where equipment execute the above method.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment
The part of detailed description, reference can be made to the related descriptions of other embodiments.
The functional units in various embodiments of the present invention may be integrated into one processing unit, is also possible to each
Unit physically exists alone, and can also be integrated in one unit with two or more units.Above-mentioned integrated unit both may be used
To use formal implementation of hardware, can also realize in the form of software functional units.
It, can if integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product
To be stored in a computer readable storage medium.Based on this understanding, technical solution of the present invention substantially or
Say that all or part of the part that contributes to existing technology or the technical solution can embody in the form of software products
Out, which is stored in a storage medium, including some instructions are used so that a computer equipment
(can be personal computer, server or network equipment etc.) executes all or part of step of each embodiment method of the present invention
Suddenly.And storage medium above-mentioned includes: USB flash disk, read-only memory (ROM, Read-Only Memory), random access memory
The various media that can store program code such as (RAM, Random Access Memory), mobile hard disk, magnetic or disk.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent
Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to
So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into
Row equivalent replacement;And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution
The range of scheme.
Claims (7)
1. a kind of document format conversion method applied to magnitude comparison, which comprises the steps of:
Read the essential information of web page files;
Dissection process is carried out to web page files, therefrom extracts the first data of demand;
First data are subjected to data cleansing, extract the second more terse data;
The second data storage of extraction is less than original value to amount of ram, and can intuitively be observed in the file of data;
Batch storage processing is carried out to the file, is compared for magnitude.
2. the document format conversion method according to claim 1 applied to magnitude comparison, which is characterized in that the step
In S1, HTML code that web page files are made of a string of character strings.
3. the document format conversion method according to claim 2 applied to magnitude comparison, which is characterized in that described pair of net
Page file carries out dissection process specifically:
Create the tag location variable for extracting preset data in HTML;
Judge that the initial data to be extracted whether there is the nesting of label, then carries out layering and matching if there is nesting, successively
Matching is until being matched to text label;If there is no nesting, then direct matched text label;
The text label of successful match is returned into a tuple container, the tuple container is made of root node and ID dictionary, word
Allusion quotation key is the attribute value of the ID of definition, and the ID reference of element is stored as the value of dictionary;
Initial data is extracted to obtain first data according to the key of dictionary and value.
4. the document format conversion method according to claim 3 applied to magnitude comparison, which is characterized in that the data
Cleaning specifically: parse html document from character string constant, resolved to tree-shaped node, the hierarchical relational section of node
Point, father node, child node indicate that each node contains corresponding label, navigates to tag element by string matching, will mark
Label and Text Feature Extraction come out, be assigned to a variable, then delete from tree or subtree provide label title all elements and it
Attribute, then the content of text for deleting element and daughter element are merged into its father's element, element sequence is turned to it
The coded string form of expression of XML tree is defaulted as ASCII coding, and coding is changed to non-UTF-8 compatible coding and is being defaulted
In the case of enable statement, the element extracted is then put into receptacle, is extracted, is cleared up by dictionary ID.
5. the document format conversion method according to claim 3 applied to magnitude comparison, which is characterized in that for having
The label node of nest relation removes the label of label during layer-by-layer processing.
6. a kind of storage medium, which is characterized in that preserve program on the storage medium, right of execution when described program is run
Benefit require any one of 1 to 5 described in method.
7. a kind of processor, which is characterized in that side described in any one of perform claim requirement 1 to 5 when described program is run
Method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811327145.7A CN109492201A (en) | 2018-11-08 | 2018-11-08 | Document format conversion method applied to magnitude comparison |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811327145.7A CN109492201A (en) | 2018-11-08 | 2018-11-08 | Document format conversion method applied to magnitude comparison |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109492201A true CN109492201A (en) | 2019-03-19 |
Family
ID=65694104
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811327145.7A Pending CN109492201A (en) | 2018-11-08 | 2018-11-08 | Document format conversion method applied to magnitude comparison |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109492201A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111125024A (en) * | 2019-11-29 | 2020-05-08 | 浪潮电子信息产业股份有限公司 | Method, device, equipment and storage medium for deleting distributed system files |
CN116127079A (en) * | 2023-04-20 | 2023-05-16 | 中电科大数据研究院有限公司 | Text classification method |
US12001397B2 (en) | 2019-11-29 | 2024-06-04 | Inspur Electronic Information Industry Co., Ltd. | Method, apparatus and device for deleting distributed system file, and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008102773A (en) * | 2006-10-19 | 2008-05-01 | R & W:Kk | Method for converting data into common format |
CN103870487A (en) * | 2012-12-13 | 2014-06-18 | 腾讯科技(深圳)有限公司 | Webpage file processing method and mobile terminal |
CN108446136A (en) * | 2018-03-22 | 2018-08-24 | 北京焦点新干线信息技术有限公司 | A kind of extracting method and system of element code |
-
2018
- 2018-11-08 CN CN201811327145.7A patent/CN109492201A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008102773A (en) * | 2006-10-19 | 2008-05-01 | R & W:Kk | Method for converting data into common format |
CN103870487A (en) * | 2012-12-13 | 2014-06-18 | 腾讯科技(深圳)有限公司 | Webpage file processing method and mobile terminal |
CN108446136A (en) * | 2018-03-22 | 2018-08-24 | 北京焦点新干线信息技术有限公司 | A kind of extracting method and system of element code |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111125024A (en) * | 2019-11-29 | 2020-05-08 | 浪潮电子信息产业股份有限公司 | Method, device, equipment and storage medium for deleting distributed system files |
CN111125024B (en) * | 2019-11-29 | 2022-05-24 | 浪潮电子信息产业股份有限公司 | Method, device, equipment and storage medium for deleting distributed system files |
US12001397B2 (en) | 2019-11-29 | 2024-06-04 | Inspur Electronic Information Industry Co., Ltd. | Method, apparatus and device for deleting distributed system file, and storage medium |
CN116127079A (en) * | 2023-04-20 | 2023-05-16 | 中电科大数据研究院有限公司 | Text classification method |
CN116127079B (en) * | 2023-04-20 | 2023-06-20 | 中电科大数据研究院有限公司 | Text classification method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109062874A (en) | Acquisition methods, terminal device and the medium of financial data | |
US11615246B2 (en) | Data-driven structure extraction from text documents | |
US20220197923A1 (en) | Apparatus and method for building big data on unstructured cyber threat information and method for analyzing unstructured cyber threat information | |
CN113807098A (en) | Model training method and device, electronic equipment and storage medium | |
CN100444591C (en) | Method for acquiring front-page keyword and its application system | |
CN110427614B (en) | Construction method and device of paragraph level, electronic equipment and storage medium | |
CN112732945B (en) | Standard knowledge graph construction and standard query method and device | |
CN107590291A (en) | A kind of searching method of picture, terminal device and storage medium | |
CN111857660B (en) | Context-aware API recommendation method and terminal based on query statement | |
CN109492201A (en) | Document format conversion method applied to magnitude comparison | |
CN112818121A (en) | Text classification method and device, computer equipment and storage medium | |
US9563847B2 (en) | Apparatus and method for building and using inference engines based on representations of data that preserve relationships between objects | |
CN111709225A (en) | Event cause and effect relationship judging method and device and computer readable storage medium | |
CN111444368B (en) | Method and device for constructing user portrait, computer equipment and storage medium | |
CN112115266A (en) | Malicious website classification method and device, computer equipment and readable storage medium | |
Jayaraj et al. | Augmenting efficiency of recruitment process using IRCF text mining algorithm | |
CN113434627A (en) | Work order processing method and device and computer readable storage medium | |
CN114282119B (en) | Scientific and technological information resource retrieval method and system based on heterogeneous information network | |
CN110874398B (en) | Forbidden word processing method and device, electronic equipment and storage medium | |
CN113254583A (en) | Document marking method, device and medium based on semantic vector | |
CN115481240A (en) | Data asset quality detection method and detection device | |
Carme et al. | The lixto project: Exploring new frontiers of web data extraction | |
CN112149022A (en) | Heuristic rule-based webpage content extraction method and system | |
JP6763967B2 (en) | Data conversion device and data conversion method | |
Vasiliev | Python for Data Science: A Hands-on Introduction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190319 |