CN110489628A - Data processing method, device and electronic equipment - Google Patents

Data processing method, device and electronic equipment Download PDF

Info

Publication number
CN110489628A
CN110489628A CN201910777066.4A CN201910777066A CN110489628A CN 110489628 A CN110489628 A CN 110489628A CN 201910777066 A CN201910777066 A CN 201910777066A CN 110489628 A CN110489628 A CN 110489628A
Authority
CN
China
Prior art keywords
data
template
field
page
website
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910777066.4A
Other languages
Chinese (zh)
Inventor
贾艾婧
张丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN201910777066.4A priority Critical patent/CN110489628A/en
Publication of CN110489628A publication Critical patent/CN110489628A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention provides a kind of data processing method, device and electronic equipment.The present invention provides a kind of task type that data processing method passes through determining website to be collected, and meta template corresponding with task type is determined according to default meta template database, then according to extraction field configuration template data are preset in meta template, template file is generated further according to meta template and template data.Data processing method provided by the invention can be website personalized customization to be collected and its own matched template file of different task type, website without sticking to any task type in the prior art is all made of identical template file, or every home Web site all needs the data processing method for formulating template file respectively, reduce artificial participation, the template file of the website to be collected generated by the data processing method is delivered in acquisition system, it can be realized the automation collection processing to initial data, improve work efficiency.

Description

Data processing method, device and electronic equipment
Technical field
The present invention relates to technical field of information processing more particularly to a kind of data processing methods, device and electronic equipment.
Background technique
With the arrival of internet big data era, exponentially rank increases the data on network, to obtain in internet Immense data resource be unable to do without the acquisition of data.
Have become the application scenarios such as scientific research, business activity using the public data in various methods acquisition internet In important link, but collected raw page data after data pick-up is handled just have certain use value.
Wherein, the method that common raw page data extracts processing is divided into two classes, and one kind is according to fixed format to extract Processing;Another kind of is independent draws format.
However, fixed format is excessively fixed, format independent is excessively flexible, is unable to satisfy at a large amount of initial data The demand of reason.Therefore, a kind of data processing method is needed, the data for original web page acquire.
Summary of the invention
The present invention provides a kind of data processing method, device and electronic equipment, to solve data acquisition in the prior art Format is excessively fixed or is excessively flexibly unable to satisfy the problem of handling a large amount of initial data.
In a first aspect, the present invention provides a kind of data processing methods, comprising:
Determine the task type of website to be collected, and opposite with the task type according to the determination of default meta template database The meta template answered, the task type are used to characterize the attribute of the website to be collected;
Field configuration template data are extracted according to presetting in the meta template, the template data is the website to be collected In with described default extract the data that match of field;
Template file is generated according to the meta template and the template data.
In a kind of possible design, described according to before being preset in the meta template and extracting field configuration template data, Include:
Obtain the page hierarchical structure data of the meta template;
The configuration data of each page is obtained according to the page hierarchical structure data, the configuration data includes field knot Structure information, field name information and field attribute information;
It is determined to preset described in the meta template according to the configuration data and extracts field.
In a kind of possible design, the page hierarchical structure data for obtaining the meta template, comprising:
Obtain the first json file of the meta template;
Parse the first Multiway Tree Structure in the first json file;
First Multiway Tree Structure is traversed with the determination page hierarchical structure data, by the page hierarchical structure Each page in data and the website to be collected establishes corresponding relationship.
In a kind of possible design, the configuration data that each page is obtained according to the page hierarchical structure data, Include:
The configuration data of each page is obtained according to first Multiway Tree Structure.
In a kind of possible design, the task type of the determination website to be collected, comprising:
According in the page structure information, page field information and page field attribute of the website to be collected at least One attribute determines the task type.
In a kind of possible design, the page structure information include in first Multiway Tree Structure highly for 1 point The quantity information of branch, wherein each branch in first Multiway Tree Structure highly for 1 corresponds to a page;
It is highly the quantity information of the branch of 2+n, institute that the page field information, which includes in first Multiway Tree Structure, Stating in the first Multiway Tree Structure is highly the corresponding page field of each branch of 2+n, wherein n is natural number;
The page field attribute is the corresponding attribute information of each page field, and the attribute information includes a variety of Attribute field.
It is described to extract field configuration template data according to presetting in the meta template in a kind of possible design, comprising:
The default extraction field is configured according to preset algorithm, to generate field expression;
Judge whether the field expression is legal according to the field attribute information, judging result is yes.
It is described that template file is generated according to the meta template and the template data in a kind of possible design, comprising:
The 2nd json file is synthesized according to the page hierarchical structure data and the template data;
Parse the second Multiway Tree Structure in the 2nd json file;
Xml document is generated according to second Multiway Tree Structure, the template file is the xml document.
Second aspect, the present invention provide a kind of data processing equipment, comprising:
Determining module, for determining the task type of website to be collected, and according to default meta template database determination and institute The corresponding meta template of task type is stated, the task type is used to characterize the attribute of the website to be collected;
Processing module, for extracting field configuration template data according to default in the meta template, the template data is The data to match in the website to be collected with the default extraction field;
Generation module, for generating template file according to the meta template and the template data.
The third aspect, the present invention provide a kind of electronic equipment, comprising:
Processor;And
Memory, for storing the executable instruction of the processor;
Wherein, the processor is configured to execute above-mentioned data processing method via the executable instruction is executed.
Data processing method, device and electronic equipment provided by the invention, by the task type of determination website to be collected, And meta template corresponding with task type is determined according to default meta template database, then according to extractor default in meta template Section configuration template data generate template file further according to meta template and template data.It is realized by the template file to original The automation collections of data is handled, and reduces artificial participation, overcome data acquisition in the prior art format it is excessively fixed or Excessively flexibly it is unable to satisfy the problem of handling a large amount of initial data.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention without any creative labor, may be used also for those of ordinary skill in the art To obtain other drawings based on these drawings.
Fig. 1 is a kind of flow diagram of data processing method provided in an embodiment of the present invention;
Fig. 2 is provided in an embodiment of the present invention a kind of according to the process for presetting extraction field configuration template data in meta template Schematic diagram;
Fig. 3 is that a kind of process for generating template file according to meta template and template data provided in an embodiment of the present invention is shown It is intended to;
Fig. 4 is a kind of structural schematic diagram of data processing equipment provided in an embodiment of the present invention;
Fig. 5 is the structural schematic diagram of a kind of electronic equipment provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.Term " in description and claims of this specification and above-mentioned attached drawing One ", the (if present)s such as " second ", " third " " the 4th " are to be used to distinguish similar objects, without specific for describing Sequence or precedence.It should be understood that the data used in this way are interchangeable under appropriate circumstances, so as to described herein hair Bright embodiment for example can be performed in other sequences than those illustrated or described herein.In addition, term " includes " " having " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing series of steps or list The process, method, system, product or equipment of member those of are not necessarily limited to be clearly listed step or unit, but may include not having There are other step or units being clearly listed or intrinsic for these process, methods, product or equipment.
Along with the arrival of internet big data era, exponentially rank increases the data on network, in face of immense number According to resource, data of the raw page data after data pick-up processing just had into certain use value.And existing skill For the extraction processing method of data one is being handled using fixed format just for a kind of data in art, if encountering not Similar data can not then be handled it, for example, the format of fixing process news data and be not available the format analysis processing Forum data;Another kind is that data processing format is excessively flexible, can not be classified as a quasi-mode and be handled, for example, for difference The news data of website is handled, then there is different data processing formats, and these data processing formats cannot be classified as to one kind, No matter such as which website, as long as same data processing format then can be used in processing news data.Therefore, in the prior art Data processing method is not able to satisfy instantly for the demand of a large amount of raw page data acquisition process.
For the above problem in the presence of the prior art, data processing method, device and electronics provided by the invention are set It is standby, by the task type of determination website to be collected, and it is corresponding with task type according to the determination of default meta template database Meta template generates then according to extraction field configuration template data are preset in meta template further according to meta template and template data Template file.The template file is consigned into acquisition system and carries out data acquisition, forms the complete closed loop of data acquisition, realization pair The automation collection of initial data is handled, and reduces artificial participation, the format for overcoming data processing in the prior art is excessively solid Determine or is excessively flexibly unable to satisfy the problem of handling a large amount of initial data.
Wherein, database involved in the embodiment of the present invention, can be using MYSQL database.
Technical solution of the present invention is described in detail with specifically embodiment below.These specific implementations below Example can be combined with each other, and the same or similar concept or process may be repeated no more in some embodiments.Below in conjunction with The embodiment of the present invention is described in attached drawing.
Fig. 1 is a kind of flow diagram of data processing method provided in an embodiment of the present invention, the method for the present embodiment by Electronic equipment executes, which can be desktop computer, notebook, tablet computer or mobile phone etc., specifically can be electricity Processor in sub- equipment is configured as executing this method, as shown in Figure 1, data processing method provided in this embodiment, including Following steps:
S101: the task type of website to be collected is determined, and according to the determination of default meta template database and task type phase Corresponding meta template.
Wherein, task type is used to characterize the attribute of website to be collected.
Website to be collected can be any type of website, for example, the inhomogeneities such as news website, forum website, blog The website of type.
Default meta template database can be the meta template that different type website is rule of thumb arranged in those skilled in the art, For example, the page structure information of usually news website includes title and text, title is generally personage and event, and start of text is When and where information, then it is title and text, the such as entitled personage of page field information that page structure information, which can be set, News website meta template is preset as with this class template of event.Similar, the different types such as forum website, blog also can be set Page structure information, page field information and the page field attribute of website.In short, can be with empirically established different type The attribute information of the page structure information of website, page field information and page field attribute, to form meta template database.
Optionally, above-mentioned page structure information include in the first Multiway Tree Structure highly for 1 branch quantity information, In, each branch in the first Multiway Tree Structure highly for 1 corresponds to a page.
Optionally, it is highly the quantity of the branch of 2+n that above-mentioned page field information, which includes in the first Multiway Tree Structure, first It is highly the corresponding page field of each branch of 2+n in Multiway Tree Structure, wherein n is natural number.
Optionally, above-mentioned page field attribute is the corresponding attribute information of each page field, wherein attribute information includes A variety of attribute fields.For example, attribute field includes: A: whether field must fill out;B: whether field tests displaying;C: field type, Such as character string type (String, hereinafter referred to as str), node (Node, abbreviation node), uniform resource locator (Uniform Resource Locator, hereinafter referred to as url);D: the title of the corresponding landing field of landing field name, i.e. present field;E: Whether field allows for sky;F: field expression quantity limitation;G: field expression default value.
In a kind of possible design, determine website task type to be collected mode can according to website to be collected page At least one attribute in face structural information, page field information and page field attribute determines the task of the website to be collected Type.
Specifically, those skilled in the art receive the task of website to be collected, and it is to be collected can rule of thumb to recognize this At least one attribute in the page structure information of website, page field information and page field attribute, according to being picked out Attribute be that can determine the task type of the website to be collected.For example, the presentation mode when each page in website to be collected is Title adds text, entitled personage and event information, and start of text is time, location information, then can determine whether the website to be collected Task type be news website.In another example there is different labels, such as emotion, workplace, family on the page of website to be collected Front yard, finance and economics etc., clicking the page opened after different labels is that then can determine whether that this is to be collected using the label as the article of classification The task type of website is forum website.
The corresponding meta template of task type is determined according to default meta template database.Appointing for website to be collected is being determined After service type, then the corresponding meta template of the task type can be selected in default meta template database.It has determined wait adopt Collect the corresponding meta template in website.Such as, it has been determined that website task type to be collected is news website, then in default meta template Select the meta template of news website as the meta template of the website to be collected in database.
S102: field configuration template data are extracted according to presetting in meta template.
Wherein, template data is the data to match in website to be collected with default extraction field.
After the corresponding meta template of task type of website to be collected has been determined, according to default extraction field configuration mould Plate data.Its configuration process can to formulate specific rule or preset algorithm, by default extractions field pass through ad hoc rules or Preset algorithm obtains corresponding output as a result, for example, output result can be the corresponding expression formula of ad hoc rules or preset algorithm, The expression formula includes the default a variety of field attribute information for extracting field.Judge that it exports whether result meets default meta template Field attribute information in database.Field configuration template data are extracted if satisfied, then completing to preset in meta template;If discontented Foot then repeats the process until presetting the corresponding output result of extraction field and meets the field preset in meta template database Attribute information, until meeting.
S103: template file is generated according to meta template and template data.
It is preset in basis and the corresponding meta template of website to be collected has been determined in meta template, and according to pre- in the meta template If after extracting field configuration template data, generating template file according to meta template and template data.Wherein, template file has been At the language file for consigning to reference format when acquisition system is acquired after the data processing method, for example, xml document. Its format for generating template file determines that different language files is then by the reference format of the required language file of acquisition system With different generation methods, the present embodiment is not construed as limiting this.
The template file of generation is delivered and is acquired in acquisition system, the complete closed loop of data acquisition is capable of forming, it is real The automation collection of existing data.The workflow system prior art of acquisition system, therefore not to repeat here by the present invention.
Data processing method provided in this embodiment, by first determining the task type of website to be collected, then according to pre- If meta template database determines meta template corresponding with task type, and extracts field configuration template according to presetting in meta template Data, and template file is generated according to meta template and template data, so as to the net to be collected for different task type It stands personalized customization and its own matched template file, the website without sticking to any task type in the prior art is equal The data processing method for formulating template file respectively is all needed using identical template file or every home Web site, and then reduces people Work participates in, and the template file of the website to be collected generated by the data processing method is delivered in acquisition system, with realization pair The automation collection of initial data is handled, and is improved work efficiency.
On the basis of embodiment shown in Fig. 1, optionally, before step S102, data provided in an embodiment of the present invention Processing method may include steps of:
S201: the page hierarchical structure data of meta template is obtained.
After the meta template for determining website to be collected, the page hierarchical structure data of current meta template is read.For example, reading page Surface layer hierarchy structure data judges that it is title-text hierarchical structure or title-title-text hierarchical structure, or Other hierarchical structures.
In a kind of possible design, the page hierarchical structure data of meta template is obtained, is included the following steps:
S2011: the first json file of meta template is obtained.
After the original template that website to be collected has been determined, the json file of lower template is worked as in load, as obtains meta template First json file.
S2012: the first Multiway Tree Structure in the first json file of parsing.
The first json file of meta template is obtained, and the first Multiway Tree Structure for including in the first json file is solved Analysis.
S2013: traversal the first Multiway Tree Structure to determine page hierarchical structure data, by page hierarchical structure data with Each page in website to be collected establishes corresponding relationship.
The first Multiway Tree Structure is traversed to determine the page hierarchical structure data of meta template instantly, such as can be determined instantly The page hierarchical structure of meta template is title-text hierarchical structure or title-title-text hierarchical structure, and is incited somebody to action The corresponding related pages into website to be collected of the acquired page hierarchical structure data, i.e., by page hierarchical structure data with Each page in website to be collected establishes corresponding relationship.
S202: the configuration data of each page is obtained according to page hierarchical structure data.
Wherein, configuration data includes field structure information, field name information and field attribute information.
After the page hierarchical structure data for obtaining meta template, obtained according to page hierarchical structure data per meta template instantly The configuration data of each page, i.e., field structure information, field name information and the field attribute information of each page.Example It such as, is the page of title for hierarchical structure, the configuration data of the page can be pickup area, every data, URL, title Etc. field names and attribute information.If for hierarchical structure being the page of text, the configuration data of the page can be acquisition The information of the field names such as region, title, text, time and attribute.
Optionally, according to the structure of the first multiway tree, all fields of each page are read in mono- json file of Cong Qi Structural information, and read field name information and all field attribute information in all field structures.
S203: it is determined to preset in meta template according to configuration data and extracts field.
The corresponding meta template in website to be collected has been determined, has been equivalent to all information on website and webpage to be collected according to pre- If the format of meta template database is presented one by one.After obtaining the configuration data of each page of meta template instantly, then it can determine Instantly field is extracted with the presence or absence of default in meta template, and if it exists, determine to preset in meta template and extract field;If it does not exist, Then the website to be collected is not belonging to the default website for extracting the wanted collecting data information of field.
The present embodiment obtains the page of meta template instantly according to before presetting extraction field configuration template data in meta template The configuration data of each page in surface layer hierarchy structure data and page hierarchical structure, i.e. field structure information, field name letter Breath and field attribute information determine to preset in meta template according to configuration data and extract field, whether to know website to be collected For the default website for extracting the wanted collecting data information of field.The same of personalized customization meta template is being realized to website to be collected When, useless website to be collected is abandoned, is improved work efficiency.
On the basis of the various embodiments described above, a kind of possible implementation of step S102 is as shown in Fig. 2, Fig. 2 is this Inventive embodiments provide a kind of according to the flow diagram for presetting extraction field configuration template data in meta template, the realization side Formula includes the following steps:
S1021: default extraction field is configured using preset algorithm, to generate field expression.
Preset algorithm is the certain rule for needing to meet in configuration process to default extraction field, generates field list with this Up to formula.
Optionally, preset algorithm can be x-path and/or regular expression.
X-path is a kind of path language, that is, is used to determine the language of certain portion in document.Regular expression, also known as Regular expression is used to retrieval, replacement meets the text of some mode or rule.Use x-path and/or regular expression Default extraction field is configured, to generate field expression.Default extraction field data is selected i.e. in website to be collected, It is configured using x-path and/or regular expression, generates the corresponding expression formula of default extraction field.
S1022: judge whether field expression is legal according to field attribute information, judging result is yes.
Wherein, field attribute information is the field attribute information in each page, which includes above-mentioned a variety of Attribute field, such as whether field must fill out, whether field tests displaying, field type, such as str, node, url, land field Title etc..It is pre- whether the field expression that the default extraction field of judgement is generated through x-path and/or regular expression configuration meets If the field attribute information in meta template database, if so, it is i.e. legal, then it completes to extract field configuration according to default in meta template Template data, continues to execute S103, and template data is the data to match in website to be collected with default extraction field.If No, i.e., illegal, then repeatedly S1021-S1022, the expression formula until presetting extraction field meet above-mentioned field attribute information.
It is provided in this embodiment to extract field configuration template data according to default in meta template, using preset algorithm to default It extracts field to be configured, for example, preset algorithm is x-path and/or regular expression, and generates corresponding field expression, Judge whether field expression is legal according to field attribute information, if legal, then completes template data configuration;If it is illegal, then Repeat preset algorithm to the default configuration for extracting field, until legal.The template data configured be website to be collected with The default data for extracting field exact matching, so that complying fully with default extractor according to the template file that the template data generates The demand of section carries out data acquisition for subsequent acquisition system and provides sufficient condition, improves work efficiency.
On the basis of the various embodiments described above, a kind of possible implementation of step 103 is as shown in figure 3, Fig. 3 is this hair A kind of flow diagram that template file is generated according to meta template and template data that bright embodiment provides, implementation packet Include following steps:
S1031: the 2nd json file is synthesized according to page hierarchical structure data and template data.
The page hierarchical structure data for reading meta template is preset according to page hierarchical structure data and according in meta template The template data for extracting field configuration synthesizes the 2nd json file.
S1032: the second Multiway Tree Structure in the 2nd json file of parsing.
After synthesizing the 2nd json file according to page hierarchical structure data and template data, the 2nd json file is parsed In the second Multiway Tree Structure.
S1033: xml document is generated according to the second Multiway Tree Structure.
Wherein, template file is xml document.
It is the prior art according to the method that Multiway Tree Structure generates xml document, is not repeated in this present embodiment.
It is provided in this embodiment it is a kind of template file is generated according to meta template and template data, by according to page level Structured data and template data synthesize the 2nd json file, parse the second Multiway Tree Structure in the 2nd json file, according to Second Multiway Tree Structure generates xml document, and completion is to the website personalized customization to be collected of different task type and its own The template file matched, the website for overcoming any task type in the prior art are all made of the ineffective activity of same template file, And every home Web site all needs the excessively flexibility for formulating template file respectively, reduces artificial participation, also, by the template file The automation collection processing that can be realized in acquisition system to initial data is delivered, is improved work efficiency.
Fig. 4 is a kind of structural schematic diagram of data processing equipment provided in an embodiment of the present invention.Number provided in this embodiment It can be used for executing the data processing method of the various embodiments described above offer according to processing unit, as shown in figure 4, data processing equipment 40 wraps Include determining module 41, processing module 42 and generation module 43.
Determining module 41 is used to determine the task type of website to be collected, and determines and appoint according to default meta template database The corresponding meta template of service type, wherein task type is used to characterize the attribute of website to be collected;Processing module 42 is used for basis It is preset in meta template and extracts field configuration template data, wherein template data is to extract field phase with default in website to be collected Matched data;Generation module 43 is used to generate template file according to meta template and template data.
The present embodiment is similar with the realization principle of each embodiment of the above method, and therefore not to repeat here.
The present embodiment determines the task type of website to be collected by determining module, and true according to default meta template database Fixed meta template corresponding with task type, processing module according to presetting extraction field configuration template data in meta template, and Generation module generates template file according to meta template and template data, can be the website individual character to be collected of different task type Change customization and its own matched template file, the website without sticking to any task type in the prior art is all made of phase Same template file or every home Web site all needs the data processing method for formulating template file respectively, reduces artificial participation, will lead to The template file for crossing the website to be collected of data processing method generation is delivered in acquisition system, can be realized to initial data Automation collection processing, improves work efficiency.
Fig. 5 is the structural schematic diagram of a kind of electronic equipment provided in an embodiment of the present invention, and electronics provided in this embodiment is set The standby data processing method that can be used for executing the various embodiments described above offer, the electronic equipment can be desktop computer, notebook, Tablet computer or mobile phone etc..As shown in figure 5, electronic equipment 50 provided in this embodiment includes:
Processor 51, memory 52, wherein processor 51 is configured to execute above-mentioned each reality via executable instruction is executed Apply data processing method in example;Memory 52 is used for the executable instruction of storage processor 51.
In the exemplary embodiment, a kind of storage medium is additionally provided, for example, readable storage medium storing program for executing can be ROM, random Access memory (RAM), CD-ROM, tape, floppy disk and optical data storage devices etc..It is stored thereon with computer program, the journey The data processing method of the various embodiments described above is realized when sequence is executed by processor.
Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to of the invention its Its embodiment.The present invention is directed to cover any variations, uses, or adaptations of the invention, these modifications, purposes or Person's adaptive change follows general principle of the invention and including the undocumented common knowledge in the art of the present invention Or conventional techniques.The description and examples are only to be considered as illustrative, and true scope and spirit of the invention are by following Claims are pointed out.
It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is only limited by appended claims System.

Claims (10)

1. a kind of data processing method characterized by comprising
Determine the task type of website to be collected, and corresponding with the task type according to the determination of default meta template database Meta template, the task type are used to characterize the attribute of the website to be collected;
According in the meta template preset extract field configuration template data, the template data be the website to be collected in The default data for extracting field and matching;
Template file is generated according to the meta template and the template data.
2. data processing method according to claim 1, which is characterized in that taken out described according to default in the meta template Before taking field configuration template data, comprising:
Obtain the page hierarchical structure data of the meta template;
The configuration data of each page is obtained according to the page hierarchical structure data, the configuration data includes field structure letter Breath, field name information and field attribute information;
It is determined to preset described in the meta template according to the configuration data and extracts field.
3. data processing method according to claim 2, which is characterized in that the page level for obtaining the meta template Structured data, comprising:
Obtain the first json file of the meta template;
Parse the first Multiway Tree Structure in the first json file;
First Multiway Tree Structure is traversed with the determination page hierarchical structure data, by the page hierarchical structure data Corresponding relationship is established with each page in the website to be collected.
4. data processing method according to claim 3, which is characterized in that described according to the page hierarchical structure data Obtain the configuration data of each page, comprising:
The configuration data of each page is obtained according to first Multiway Tree Structure.
5. data processing method according to claim 3, which is characterized in that the task class of the determination website to be collected Type, comprising:
According at least one of the page structure information, page field information and page field attribute of the website to be collected Attribute determines the task type.
6. data processing method according to claim 5, which is characterized in that the page structure information includes described first It is highly the quantity information of 1 branch in Multiway Tree Structure, wherein be highly each of 1 point in first Multiway Tree Structure Branch corresponds to a page;
It is highly the quantity information of the branch of 2+n that the page field information, which includes in first Multiway Tree Structure, described the It is highly the corresponding page field of each branch of 2+n in one Multiway Tree Structure, wherein n is natural number;
The page field attribute is the corresponding attribute information of each page field, and the attribute information includes a variety of attributes Field.
7. the data processing method according to any one of claim 2-6, which is characterized in that described according to first mould It is preset in plate and extracts field configuration template data, comprising:
The default extraction field is configured according to preset algorithm, to generate field expression;
Judge whether the field expression is legal according to the field attribute information, judging result is yes.
8. data processing method according to claim 7, which is characterized in that described according to the meta template and the mould Plate data generate template file, comprising:
The 2nd json file is synthesized according to the page hierarchical structure data and the template data;
Parse the second Multiway Tree Structure in the 2nd json file;
Xml document is generated according to second Multiway Tree Structure, the template file is the xml document.
9. a kind of data processing equipment characterized by comprising
Determining module determines and described for determining the task type of website to be collected, and according to default meta template database The corresponding meta template of service type, the task type are used to characterize the attribute of the website to be collected;
Processing module, for according to extraction field configuration template data are preset in the meta template, the template data to be described The data to match in website to be collected with the default extraction field;
Generation module, for generating template file according to the meta template and the template data.
10. a kind of electronic equipment characterized by comprising
Processor;And
Memory, for storing the executable instruction of the processor;
Wherein, the processor is configured to come described in any one of perform claim requirement 1 to 8 via the execution executable instruction Data processing method.
CN201910777066.4A 2019-08-22 2019-08-22 Data processing method, device and electronic equipment Pending CN110489628A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910777066.4A CN110489628A (en) 2019-08-22 2019-08-22 Data processing method, device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910777066.4A CN110489628A (en) 2019-08-22 2019-08-22 Data processing method, device and electronic equipment

Publications (1)

Publication Number Publication Date
CN110489628A true CN110489628A (en) 2019-11-22

Family

ID=68552741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910777066.4A Pending CN110489628A (en) 2019-08-22 2019-08-22 Data processing method, device and electronic equipment

Country Status (1)

Country Link
CN (1) CN110489628A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541107A (en) * 2020-12-25 2021-03-23 天津浪淘科技股份有限公司 Page data learning and automatic acquisition method
CN112597420A (en) * 2020-12-25 2021-04-02 第四范式(北京)技术有限公司 Method and device for realizing unified data management

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090063500A1 (en) * 2007-08-31 2009-03-05 Microsoft Corporation Extracting data content items using template matching
CN101944094A (en) * 2009-07-06 2011-01-12 富士通株式会社 Webpage information extraction method and device thereof
CN102681994A (en) * 2011-03-07 2012-09-19 北京百度网讯科技有限公司 Webpage information extracting method and system
CN109343851A (en) * 2018-09-26 2019-02-15 中国平安人寿保险股份有限公司 Page generation method, device, computer equipment and storage medium
CN109542901A (en) * 2018-11-12 2019-03-29 北京懿医云科技有限公司 Data processing method, device, computer readable storage medium and electronic equipment
CN109753596A (en) * 2018-12-29 2019-05-14 中国科学院计算技术研究所 Information source management and configuration method and system for the acquisition of large scale network data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090063500A1 (en) * 2007-08-31 2009-03-05 Microsoft Corporation Extracting data content items using template matching
CN101944094A (en) * 2009-07-06 2011-01-12 富士通株式会社 Webpage information extraction method and device thereof
CN102681994A (en) * 2011-03-07 2012-09-19 北京百度网讯科技有限公司 Webpage information extracting method and system
CN109343851A (en) * 2018-09-26 2019-02-15 中国平安人寿保险股份有限公司 Page generation method, device, computer equipment and storage medium
CN109542901A (en) * 2018-11-12 2019-03-29 北京懿医云科技有限公司 Data processing method, device, computer readable storage medium and electronic equipment
CN109753596A (en) * 2018-12-29 2019-05-14 中国科学院计算技术研究所 Information source management and configuration method and system for the acquisition of large scale network data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘纪平: "《网络地理信息获取融合与分析挖掘》", 31 March 2018, 测绘出版社 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541107A (en) * 2020-12-25 2021-03-23 天津浪淘科技股份有限公司 Page data learning and automatic acquisition method
CN112597420A (en) * 2020-12-25 2021-04-02 第四范式(北京)技术有限公司 Method and device for realizing unified data management

Similar Documents

Publication Publication Date Title
CN107766371B (en) Text information classification method and device
CN108305180B (en) Friend recommendation method and device
JP2003330948A (en) Device and method for evaluating web page
CN109740159B (en) Processing method and device for named entity recognition
CN103778200B (en) A kind of message information source abstracting method and its system
CN104915426B (en) Information sorting method, the method and device for generating information sorting model
CN108334489A (en) Text core word recognition method and device
CN106874397B (en) Automatic semantic annotation method for Internet of things equipment
EP3961426A2 (en) Method and apparatus for recommending document, electronic device and medium
CN111737443B (en) Answer text processing method and device and key text determining method
CN114238573A (en) Information pushing method and device based on text countermeasure sample
CN110489628A (en) Data processing method, device and electronic equipment
CN104537080B (en) Information recommends method and system
Fuad et al. Analysis and classification of mobile apps using topic modeling: A case study on Google Play Arabic apps
CN105183843B (en) list page identification system and method
JP6868576B2 (en) Event presentation system and event presentation device
CN112307318A (en) Content publishing method, system and device
CN106383857A (en) Information processing method and electronic equipment
CN107506407B (en) File classification and calling method and device
CN111475607B (en) Web data clustering method based on Mashup service function feature representation and density peak detection
CN109033078B (en) The recognition methods of sentence classification and device, storage medium, processor
JP2001209655A (en) Information providing device, information updating method, recording medium having information providing program recorded thereon and information providing system
CN108875014B (en) Precise project recommendation method based on big data and artificial intelligence and robot system
CN107590163B (en) The methods, devices and systems of text feature selection
CN105677827B (en) A kind of acquisition methods and device of list

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191122