CN107085610A - A kind of intelligent unstructured data processing method - Google Patents
A kind of intelligent unstructured data processing method Download PDFInfo
- Publication number
- CN107085610A CN107085610A CN201710283018.0A CN201710283018A CN107085610A CN 107085610 A CN107085610 A CN 107085610A CN 201710283018 A CN201710283018 A CN 201710283018A CN 107085610 A CN107085610 A CN 107085610A
- Authority
- CN
- China
- Prior art keywords
- data processing
- template
- data
- template name
- canonical
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of intelligent unstructured data processing method, comprise the following steps:One or more webpage character strings are retrieved from pending structural data, webpage character string to retrieval carries out unstructured data processing, webpage character string is configured to by canonical template by the data processing of regular expression non-structural, need the one or more network character strings retrieved in structural data carrying out storage backup before configuration canonical template, structure title in canonical template is configured to template name by processing, the template name and data that will be configured to are integrated, then structuring output is carried out, template name is integrated with data by smart allocation, then corresponding chart is generated.Speed of the present invention is than conventional method faster;Data processing can be limited in specified range, and support list data, while multiple data can be matched once, a field supports multiple canonical inputs, ultimately generates structural data.
Description
Technical field
The present invention relates to technical field of data processing, specially a kind of intelligent unstructured data processing method.
Background technology
With the fast development of internet, the rapid popularization of the application such as webpage, blog, social networks, instant communication software, production
Substantial amounts of content-data has been given birth to, wherein, user's registration information, the feature for accessing the presentation structuring of the data such as record;And it is webpage, rich
The data structure that the content-datas such as visitor, forum are not fixed, data volume is huge, how shows non-structured data characteristics
Effective storage, management and retrieval are carried out to these large-scale structurings and non-structured data, become industry research
Focus, traditional structural data processing method processing data has that speed is not fast enough and the not high enough defect of efficiency.
The content of the invention
It is an object of the invention to provide a kind of intelligent unstructured data processing method, to solve above-mentioned background technology
The problem of middle proposition.
To achieve the above object, the present invention provides following technical scheme:A kind of intelligent unstructured data processing method,
Comprise the following steps:
(1) one or more webpage character strings are retrieved from pending structural data;
(2) the webpage character string to retrieval carries out unstructured data processing, at regular expression non-structural data
Webpage character string is configured to canonical template by reason;
(3) the structure title in canonical template is configured to template name by processing;
(4) template name and data that will be configured to are integrated, and then carry out structuring output;
(5) template name is integrated with data by smart allocation, then generates corresponding chart.
It is preferred that, the step (2) needs one or many will retrieved in structural data before configuration canonical template
Individual network character string carries out storage backup.
It is preferred that, the template name that the step (3) is configured to needs the rule carry out order arrangement in database.
It is preferred that, the step (3) is after the order arrangement of template name is carried out, it is necessary to according to title and the length of information
Progress is intelligentized to be write a Chinese character in simplified form, and step (4) is then carried out again.
It is preferred that, the step (5) is after corresponding chart is generated, it is necessary to according to the length of template name and information to figure
Tableau format is adjusted.
Compared with prior art, the beneficial effects of the invention are as follows:Speed of the present invention than conventional method faster, can be by data
Processing is limited in specified range, and support list data, while multiple data can be matched once, a field supports multiple
Canonical is inputted, and ultimately generates structural data.
Brief description of the drawings
Fig. 1 is flow chart of the present invention.
Embodiment
The technical scheme in the embodiment of the present invention is clearly and completely described below, it is clear that described embodiment
Only a part of embodiment of the invention, rather than whole embodiments.Based on the embodiment in the present invention, the common skill in this area
The every other embodiment that art personnel are obtained under the premise of creative work is not made, belongs to the model that the present invention is protected
Enclose.
The present invention provides a kind of technical scheme:A kind of intelligent unstructured data processing method, comprises the following steps:
(1) one or more webpage character strings are retrieved from pending structural data;
(2) the webpage character string to retrieval carries out unstructured data processing, at regular expression non-structural data
Webpage character string is configured to canonical template by reason;
(3) the structure title in canonical template is configured to template name by processing;
(4) template name and data that will be configured to are integrated, and then carry out structuring output;
(5) template name is integrated with data by smart allocation, then generates corresponding chart.
Embodiment one:
One or more webpage character strings are retrieved from pending structural data first, then to the webpage word of retrieval
Symbol string carries out unstructured data processing, and webpage character string is configured into modulus of regularity by the data processing of regular expression non-structural
Plate, then by the structure title in canonical template by processing be configured to template name, the template name that then will be configured to and
Data are integrated, and are then carried out structuring output, are integrated template name with data finally by smart allocation, then
Generate corresponding chart.
Embodiment two:
In embodiment one, along with following processes:
Step (2) needs the one or more network character strings that will be retrieved in structural data before configuration canonical template
Carry out storage backup.
One or more webpage character strings are retrieved from pending structural data first, before configuration canonical template
Need the one or more network character strings retrieved in structural data carrying out storage backup, then to the webpage character of retrieval
String carries out unstructured data processing, and webpage character string is configured into modulus of regularity by the data processing of regular expression non-structural
Plate, then by the structure title in canonical template by processing be configured to template name, the template name that then will be configured to and
Data are integrated, and are then carried out structuring output, are integrated template name with data finally by smart allocation, then
Generate corresponding chart.
Embodiment three:
In embodiment two, along with following processes:
The template name that step (3) is configured to needs the rule carry out order arrangement in database.
One or more webpage character strings are retrieved from pending structural data first, before configuration canonical template
Need the one or more network character strings retrieved in structural data carrying out storage backup, then to the webpage character of retrieval
String carries out unstructured data processing, and webpage character string is configured into modulus of regularity by the data processing of regular expression non-structural
Plate, is then configured to template name, the template name being configured to needs basis by the structure title in canonical template by processing
Rule carry out order arrangement in database, the template name and data that then will be configured to is integrated, and then carries out structure
Change output, template name is integrated with data finally by smart allocation, corresponding chart is then generated.
Example IV:
In embodiment three, along with following processes:
Step (3) is after the order arrangement of template name is carried out, it is necessary to be carried out according to the length of title and information intelligent
Write a Chinese character in simplified form, step (4) is then carried out again.
One or more webpage character strings are retrieved from pending structural data first, before configuration canonical template
Need the one or more network character strings retrieved in structural data carrying out storage backup, then to the webpage character of retrieval
String carries out unstructured data processing, and webpage character string is configured into modulus of regularity by the data processing of regular expression non-structural
Plate, is then configured to template name, the template name being configured to needs basis by the structure title in canonical template by processing
Rule carry out order arrangement in database, it is necessary to according to title and the length of information after the order arrangement of template name is carried out
Degree progress is intelligentized to be write a Chinese character in simplified form, and the template name and data that then will be configured to are integrated, and then carries out structuring output, most
Template name is integrated with data by smart allocation afterwards, corresponding chart is then generated.
Embodiment five:
In example IV, along with following processes:
Step (5) is after corresponding chart is generated, it is necessary to be entered according to the length of template name and information to figure tableau format
Row adjustment.
One or more webpage character strings are retrieved from pending structural data first, before configuration canonical template
Need the one or more network character strings retrieved in structural data carrying out storage backup, then to the webpage character of retrieval
String carries out unstructured data processing, and webpage character string is configured into modulus of regularity by the data processing of regular expression non-structural
Plate, is then configured to template name, the template name being configured to needs basis by the structure title in canonical template by processing
Rule carry out order arrangement in database, it is necessary to according to title and the length of information after the order arrangement of template name is carried out
Degree progress is intelligentized to be write a Chinese character in simplified form, and the template name and data that then will be configured to are integrated, and then carries out structuring output, most
Template name is integrated with data by smart allocation afterwards, corresponding chart is then generated, after corresponding chart is generated,
Need to be adjusted figure tableau format according to the length of template name and information.
Data processing than conventional method faster, can be limited in specified range, and support list number by speed of the present invention
According to while multiple data can be matched once, a field supports multiple canonical inputs, ultimately generates structural data.
Although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, can be with
A variety of changes, modification can be carried out to these embodiments, replace without departing from the principles and spirit of the present invention by understanding
And modification, the scope of the present invention is defined by the appended.
Claims (5)
1. a kind of intelligent unstructured data processing method, it is characterised in that:Comprise the following steps:
(1) one or more webpage character strings are retrieved from pending structural data;
(2) the webpage character string to retrieval carries out unstructured data processing, will by the data processing of regular expression non-structural
Webpage character string is configured to canonical template;
(3) the structure title in canonical template is configured to template name by processing;
(4) template name and data that will be configured to are integrated, and then carry out structuring output;
(5) template name is integrated with data by smart allocation, then generates corresponding chart.
2. a kind of intelligent unstructured data processing method according to claim 1, it is characterised in that:The step
(2) need to carry out the one or more network character strings retrieved in structural data to store standby before configuration canonical template
Part.
3. a kind of intelligent unstructured data processing method according to claim 1, it is characterised in that:The step
(3) template name being configured to needs the rule carry out order arrangement in database.
4. a kind of intelligent unstructured data processing method according to claim 1 or 3, it is characterised in that:The step
Suddenly (3) carry out template name order arrangement after, it is necessary to according to the length of title and information carry out it is intelligentized write a Chinese character in simplified form, then
Step (4) is carried out again.
5. a kind of intelligent unstructured data processing method according to claim 1, it is characterised in that:The step
(5), it is necessary to be adjusted according to the length of template name and information to figure tableau format after corresponding chart is generated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710283018.0A CN107085610A (en) | 2017-04-26 | 2017-04-26 | A kind of intelligent unstructured data processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710283018.0A CN107085610A (en) | 2017-04-26 | 2017-04-26 | A kind of intelligent unstructured data processing method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107085610A true CN107085610A (en) | 2017-08-22 |
Family
ID=59613025
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710283018.0A Pending CN107085610A (en) | 2017-04-26 | 2017-04-26 | A kind of intelligent unstructured data processing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107085610A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111461668A (en) * | 2020-04-08 | 2020-07-28 | 国网天津市电力公司 | Digital auditing system and method based on process automation technology |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101561802A (en) * | 2008-04-18 | 2009-10-21 | 上海复旦光华信息科技股份有限公司 | Web page structural data extraction method and system |
CN104268283A (en) * | 2014-10-21 | 2015-01-07 | 浪潮集团有限公司 | Method for automatically analyzing Internet web page |
CN105138575A (en) * | 2015-07-29 | 2015-12-09 | 百度在线网络技术(北京)有限公司 | Analysis method and device of voice text string |
-
2017
- 2017-04-26 CN CN201710283018.0A patent/CN107085610A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101561802A (en) * | 2008-04-18 | 2009-10-21 | 上海复旦光华信息科技股份有限公司 | Web page structural data extraction method and system |
CN104268283A (en) * | 2014-10-21 | 2015-01-07 | 浪潮集团有限公司 | Method for automatically analyzing Internet web page |
CN105138575A (en) * | 2015-07-29 | 2015-12-09 | 百度在线网络技术(北京)有限公司 | Analysis method and device of voice text string |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111461668A (en) * | 2020-04-08 | 2020-07-28 | 国网天津市电力公司 | Digital auditing system and method based on process automation technology |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103390038B (en) | A kind of method of structure based on HBase and retrieval increment index | |
CN104021198B (en) | The relational database information search method and device indexed based on Ontology | |
CN103699525A (en) | Method and device for automatically generating abstract on basis of multi-dimensional characteristics of text | |
CN106446072B (en) | The treating method and apparatus of web page contents | |
CN104102737A (en) | Historical data storage method and system | |
CN109657072B (en) | Intelligent search WEB system and method applied to government aid decision | |
CN103631909A (en) | System and method for combined processing of large-scale structured and unstructured data | |
CN101799827A (en) | Video database management method based on layering structure | |
CN107391502A (en) | The data query method, apparatus and index structuring method of time interval, device | |
CN106528898A (en) | Method and device for converting data of non-relational database into relational database | |
CN106649578A (en) | Public opinion analysis method and system based on social network platform | |
CN106919697A (en) | A kind of method that data are imported multiple Hadoop components simultaneously | |
CN102867049A (en) | Chinese PINYIN quick word segmentation method based on word search tree | |
CN104408128B (en) | A kind of reading optimization method indexed based on B+ trees asynchronous refresh | |
CN109922131A (en) | Date storage method, device, equipment and storage medium based on block chain | |
CN102270238A (en) | Method and device for establishing continuation of Chinese knowledge points | |
CN107085610A (en) | A kind of intelligent unstructured data processing method | |
CN107861965A (en) | Data intelligence recognition methods and system | |
CN106599305B (en) | Crowdsourcing-based heterogeneous media semantic fusion method | |
KR101830504B1 (en) | In-Memory DB Connection Support Type Scheduling Method and System for Real-Time Big Data Analysis in Distributed Computing Environment | |
CN201955781U (en) | Web information issuing and management system | |
CN103279506A (en) | Method for extracting journal paper unstructured data based on electric power technology | |
CN106484660A (en) | Title treating method and apparatus | |
CN113177478B (en) | Short video semantic annotation method based on transfer learning | |
Yang et al. | Mass flow logs analysis system based on Hadoop |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170822 |