CN107085610A - A kind of intelligent unstructured data processing method - Google Patents

A kind of intelligent unstructured data processing method Download PDF

Info

Publication number
CN107085610A
CN107085610A CN201710283018.0A CN201710283018A CN107085610A CN 107085610 A CN107085610 A CN 107085610A CN 201710283018 A CN201710283018 A CN 201710283018A CN 107085610 A CN107085610 A CN 107085610A
Authority
CN
China
Prior art keywords
data processing
template
data
template name
canonical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710283018.0A
Other languages
Chinese (zh)
Inventor
王振宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wenzhou Lucheng District New Research Institute Of Advanced Technology
Original Assignee
Wenzhou Lucheng District New Research Institute Of Advanced Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wenzhou Lucheng District New Research Institute Of Advanced Technology filed Critical Wenzhou Lucheng District New Research Institute Of Advanced Technology
Priority to CN201710283018.0A priority Critical patent/CN107085610A/en
Publication of CN107085610A publication Critical patent/CN107085610A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of intelligent unstructured data processing method, comprise the following steps:One or more webpage character strings are retrieved from pending structural data, webpage character string to retrieval carries out unstructured data processing, webpage character string is configured to by canonical template by the data processing of regular expression non-structural, need the one or more network character strings retrieved in structural data carrying out storage backup before configuration canonical template, structure title in canonical template is configured to template name by processing, the template name and data that will be configured to are integrated, then structuring output is carried out, template name is integrated with data by smart allocation, then corresponding chart is generated.Speed of the present invention is than conventional method faster;Data processing can be limited in specified range, and support list data, while multiple data can be matched once, a field supports multiple canonical inputs, ultimately generates structural data.

Description

A kind of intelligent unstructured data processing method
Technical field
The present invention relates to technical field of data processing, specially a kind of intelligent unstructured data processing method.
Background technology
With the fast development of internet, the rapid popularization of the application such as webpage, blog, social networks, instant communication software, production Substantial amounts of content-data has been given birth to, wherein, user's registration information, the feature for accessing the presentation structuring of the data such as record;And it is webpage, rich The data structure that the content-datas such as visitor, forum are not fixed, data volume is huge, how shows non-structured data characteristics Effective storage, management and retrieval are carried out to these large-scale structurings and non-structured data, become industry research Focus, traditional structural data processing method processing data has that speed is not fast enough and the not high enough defect of efficiency.
The content of the invention
It is an object of the invention to provide a kind of intelligent unstructured data processing method, to solve above-mentioned background technology The problem of middle proposition.
To achieve the above object, the present invention provides following technical scheme:A kind of intelligent unstructured data processing method, Comprise the following steps:
(1) one or more webpage character strings are retrieved from pending structural data;
(2) the webpage character string to retrieval carries out unstructured data processing, at regular expression non-structural data Webpage character string is configured to canonical template by reason;
(3) the structure title in canonical template is configured to template name by processing;
(4) template name and data that will be configured to are integrated, and then carry out structuring output;
(5) template name is integrated with data by smart allocation, then generates corresponding chart.
It is preferred that, the step (2) needs one or many will retrieved in structural data before configuration canonical template Individual network character string carries out storage backup.
It is preferred that, the template name that the step (3) is configured to needs the rule carry out order arrangement in database.
It is preferred that, the step (3) is after the order arrangement of template name is carried out, it is necessary to according to title and the length of information Progress is intelligentized to be write a Chinese character in simplified form, and step (4) is then carried out again.
It is preferred that, the step (5) is after corresponding chart is generated, it is necessary to according to the length of template name and information to figure Tableau format is adjusted.
Compared with prior art, the beneficial effects of the invention are as follows:Speed of the present invention than conventional method faster, can be by data Processing is limited in specified range, and support list data, while multiple data can be matched once, a field supports multiple Canonical is inputted, and ultimately generates structural data.
Brief description of the drawings
Fig. 1 is flow chart of the present invention.
Embodiment
The technical scheme in the embodiment of the present invention is clearly and completely described below, it is clear that described embodiment Only a part of embodiment of the invention, rather than whole embodiments.Based on the embodiment in the present invention, the common skill in this area The every other embodiment that art personnel are obtained under the premise of creative work is not made, belongs to the model that the present invention is protected Enclose.
The present invention provides a kind of technical scheme:A kind of intelligent unstructured data processing method, comprises the following steps:
(1) one or more webpage character strings are retrieved from pending structural data;
(2) the webpage character string to retrieval carries out unstructured data processing, at regular expression non-structural data Webpage character string is configured to canonical template by reason;
(3) the structure title in canonical template is configured to template name by processing;
(4) template name and data that will be configured to are integrated, and then carry out structuring output;
(5) template name is integrated with data by smart allocation, then generates corresponding chart.
Embodiment one:
One or more webpage character strings are retrieved from pending structural data first, then to the webpage word of retrieval Symbol string carries out unstructured data processing, and webpage character string is configured into modulus of regularity by the data processing of regular expression non-structural Plate, then by the structure title in canonical template by processing be configured to template name, the template name that then will be configured to and Data are integrated, and are then carried out structuring output, are integrated template name with data finally by smart allocation, then Generate corresponding chart.
Embodiment two:
In embodiment one, along with following processes:
Step (2) needs the one or more network character strings that will be retrieved in structural data before configuration canonical template Carry out storage backup.
One or more webpage character strings are retrieved from pending structural data first, before configuration canonical template Need the one or more network character strings retrieved in structural data carrying out storage backup, then to the webpage character of retrieval String carries out unstructured data processing, and webpage character string is configured into modulus of regularity by the data processing of regular expression non-structural Plate, then by the structure title in canonical template by processing be configured to template name, the template name that then will be configured to and Data are integrated, and are then carried out structuring output, are integrated template name with data finally by smart allocation, then Generate corresponding chart.
Embodiment three:
In embodiment two, along with following processes:
The template name that step (3) is configured to needs the rule carry out order arrangement in database.
One or more webpage character strings are retrieved from pending structural data first, before configuration canonical template Need the one or more network character strings retrieved in structural data carrying out storage backup, then to the webpage character of retrieval String carries out unstructured data processing, and webpage character string is configured into modulus of regularity by the data processing of regular expression non-structural Plate, is then configured to template name, the template name being configured to needs basis by the structure title in canonical template by processing Rule carry out order arrangement in database, the template name and data that then will be configured to is integrated, and then carries out structure Change output, template name is integrated with data finally by smart allocation, corresponding chart is then generated.
Example IV:
In embodiment three, along with following processes:
Step (3) is after the order arrangement of template name is carried out, it is necessary to be carried out according to the length of title and information intelligent Write a Chinese character in simplified form, step (4) is then carried out again.
One or more webpage character strings are retrieved from pending structural data first, before configuration canonical template Need the one or more network character strings retrieved in structural data carrying out storage backup, then to the webpage character of retrieval String carries out unstructured data processing, and webpage character string is configured into modulus of regularity by the data processing of regular expression non-structural Plate, is then configured to template name, the template name being configured to needs basis by the structure title in canonical template by processing Rule carry out order arrangement in database, it is necessary to according to title and the length of information after the order arrangement of template name is carried out Degree progress is intelligentized to be write a Chinese character in simplified form, and the template name and data that then will be configured to are integrated, and then carries out structuring output, most Template name is integrated with data by smart allocation afterwards, corresponding chart is then generated.
Embodiment five:
In example IV, along with following processes:
Step (5) is after corresponding chart is generated, it is necessary to be entered according to the length of template name and information to figure tableau format Row adjustment.
One or more webpage character strings are retrieved from pending structural data first, before configuration canonical template Need the one or more network character strings retrieved in structural data carrying out storage backup, then to the webpage character of retrieval String carries out unstructured data processing, and webpage character string is configured into modulus of regularity by the data processing of regular expression non-structural Plate, is then configured to template name, the template name being configured to needs basis by the structure title in canonical template by processing Rule carry out order arrangement in database, it is necessary to according to title and the length of information after the order arrangement of template name is carried out Degree progress is intelligentized to be write a Chinese character in simplified form, and the template name and data that then will be configured to are integrated, and then carries out structuring output, most Template name is integrated with data by smart allocation afterwards, corresponding chart is then generated, after corresponding chart is generated, Need to be adjusted figure tableau format according to the length of template name and information.
Data processing than conventional method faster, can be limited in specified range, and support list number by speed of the present invention According to while multiple data can be matched once, a field supports multiple canonical inputs, ultimately generates structural data.
Although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, can be with A variety of changes, modification can be carried out to these embodiments, replace without departing from the principles and spirit of the present invention by understanding And modification, the scope of the present invention is defined by the appended.

Claims (5)

1. a kind of intelligent unstructured data processing method, it is characterised in that:Comprise the following steps:
(1) one or more webpage character strings are retrieved from pending structural data;
(2) the webpage character string to retrieval carries out unstructured data processing, will by the data processing of regular expression non-structural Webpage character string is configured to canonical template;
(3) the structure title in canonical template is configured to template name by processing;
(4) template name and data that will be configured to are integrated, and then carry out structuring output;
(5) template name is integrated with data by smart allocation, then generates corresponding chart.
2. a kind of intelligent unstructured data processing method according to claim 1, it is characterised in that:The step (2) need to carry out the one or more network character strings retrieved in structural data to store standby before configuration canonical template Part.
3. a kind of intelligent unstructured data processing method according to claim 1, it is characterised in that:The step (3) template name being configured to needs the rule carry out order arrangement in database.
4. a kind of intelligent unstructured data processing method according to claim 1 or 3, it is characterised in that:The step Suddenly (3) carry out template name order arrangement after, it is necessary to according to the length of title and information carry out it is intelligentized write a Chinese character in simplified form, then Step (4) is carried out again.
5. a kind of intelligent unstructured data processing method according to claim 1, it is characterised in that:The step (5), it is necessary to be adjusted according to the length of template name and information to figure tableau format after corresponding chart is generated.
CN201710283018.0A 2017-04-26 2017-04-26 A kind of intelligent unstructured data processing method Pending CN107085610A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710283018.0A CN107085610A (en) 2017-04-26 2017-04-26 A kind of intelligent unstructured data processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710283018.0A CN107085610A (en) 2017-04-26 2017-04-26 A kind of intelligent unstructured data processing method

Publications (1)

Publication Number Publication Date
CN107085610A true CN107085610A (en) 2017-08-22

Family

ID=59613025

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710283018.0A Pending CN107085610A (en) 2017-04-26 2017-04-26 A kind of intelligent unstructured data processing method

Country Status (1)

Country Link
CN (1) CN107085610A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111461668A (en) * 2020-04-08 2020-07-28 国网天津市电力公司 Digital auditing system and method based on process automation technology

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101561802A (en) * 2008-04-18 2009-10-21 上海复旦光华信息科技股份有限公司 Web page structural data extraction method and system
CN104268283A (en) * 2014-10-21 2015-01-07 浪潮集团有限公司 Method for automatically analyzing Internet web page
CN105138575A (en) * 2015-07-29 2015-12-09 百度在线网络技术(北京)有限公司 Analysis method and device of voice text string

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101561802A (en) * 2008-04-18 2009-10-21 上海复旦光华信息科技股份有限公司 Web page structural data extraction method and system
CN104268283A (en) * 2014-10-21 2015-01-07 浪潮集团有限公司 Method for automatically analyzing Internet web page
CN105138575A (en) * 2015-07-29 2015-12-09 百度在线网络技术(北京)有限公司 Analysis method and device of voice text string

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111461668A (en) * 2020-04-08 2020-07-28 国网天津市电力公司 Digital auditing system and method based on process automation technology

Similar Documents

Publication Publication Date Title
CN103390038B (en) A kind of method of structure based on HBase and retrieval increment index
CN104021198B (en) The relational database information search method and device indexed based on Ontology
CN103699525A (en) Method and device for automatically generating abstract on basis of multi-dimensional characteristics of text
CN106446072B (en) The treating method and apparatus of web page contents
CN104102737A (en) Historical data storage method and system
CN109657072B (en) Intelligent search WEB system and method applied to government aid decision
CN103631909A (en) System and method for combined processing of large-scale structured and unstructured data
CN101799827A (en) Video database management method based on layering structure
CN107391502A (en) The data query method, apparatus and index structuring method of time interval, device
CN106528898A (en) Method and device for converting data of non-relational database into relational database
CN106649578A (en) Public opinion analysis method and system based on social network platform
CN106919697A (en) A kind of method that data are imported multiple Hadoop components simultaneously
CN102867049A (en) Chinese PINYIN quick word segmentation method based on word search tree
CN104408128B (en) A kind of reading optimization method indexed based on B+ trees asynchronous refresh
CN109922131A (en) Date storage method, device, equipment and storage medium based on block chain
CN102270238A (en) Method and device for establishing continuation of Chinese knowledge points
CN107085610A (en) A kind of intelligent unstructured data processing method
CN107861965A (en) Data intelligence recognition methods and system
CN106599305B (en) Crowdsourcing-based heterogeneous media semantic fusion method
KR101830504B1 (en) In-Memory DB Connection Support Type Scheduling Method and System for Real-Time Big Data Analysis in Distributed Computing Environment
CN201955781U (en) Web information issuing and management system
CN103279506A (en) Method for extracting journal paper unstructured data based on electric power technology
CN106484660A (en) Title treating method and apparatus
CN113177478B (en) Short video semantic annotation method based on transfer learning
Yang et al. Mass flow logs analysis system based on Hadoop

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170822