CN108399205A - A kind of data high-speed processing conversion communication means and device - Google Patents
A kind of data high-speed processing conversion communication means and device Download PDFInfo
- Publication number
- CN108399205A CN108399205A CN201810096708.XA CN201810096708A CN108399205A CN 108399205 A CN108399205 A CN 108399205A CN 201810096708 A CN201810096708 A CN 201810096708A CN 108399205 A CN108399205 A CN 108399205A
- Authority
- CN
- China
- Prior art keywords
- data
- web
- speed processing
- communication means
- garbled
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Transfer Between Computers (AREA)
Abstract
An embodiment of the present invention provides a kind of processing of data high-speed to convert communication means, the method includes the steps:According to preset data collecting rule, web data is collected;Collected web data is filtered and normalized, obtains garbled data;Using default disaggregated model, classify to the garbled data obtained, obtains sorted K classes data;It is handled using radio frequency high-speed data in the fpga chip in board, high speed processing is carried out to sorted K classes data.Using the embodiment of the present invention, web data can be effectively extracted, and duplicate message is normalized, user is facilitated to efficiently use web data.
Description
Technical field
The present invention relates to electronic technology field more particularly to a kind of data high-speed processing conversion communication means and device.
Background technology
With the universal and internet of computer(WWW)Fast development, a large amount of information is in the form of electronic document
It appears in face of people.In order to cope with the serious challenge that information explosion is brought, there is an urgent need to the tools of some automations to help people
The information really needed is quickly found in magnanimity information source.Information extraction(Information Extraction)Research is just
It generates in this background.
The major function of information extraction system is that specific factural information is extracted from text(factual
information).For example, extracting the details of terrorist incident from news report:It is time, place, criminal, aggrieved
Person, target, the weapon etc. used;The case where company's publication new product is extracted from Economic News:Company name, product
Name, issuing time, properties of product etc.;Symptom, idagnostic logout, inspection result, prescription etc. are extracted from the medical records of patient
Deng.In general, the information being extracted is described in the form of structuring, can be directly stored in database, for user inquire with
And further analysis and utilization.
Information extraction field is an emerging research field, is generally referred to automatic from a given collection of document
It identifies the type informations such as preset entity, relationship and event, and structured storage and management is carried out to these information
Process.Information extraction has important application in many fields.
A research closely related with information extraction is information retrieval, but information extraction has differences with information retrieval,
It is mainly manifested in three aspects:
1. function is different.Information retrieval system is mainly found and the relevant document of user demand from a large amount of collection of document
List;And information extraction system is then intended to directly obtain the interested factural information of user from text.
2. treatment technology is different.Information retrieval system usually using technologies such as statistics and Keywords matchings, is regarded text as
The set of word(bags of words), in-depth analysis understanding need not be carried out to text;And information extraction often will be by nature
Language processing techniques, by text sentence and chapter carry out analyzing processing after could complete.
3. suitable application area is different.Since the technology of use is different, information retrieval system is typically that field is unrelated, and believes
It is then that field is relevant to cease extraction system, can only extraction system pre-set limited kinds the fact information.
On the other hand, information retrieval and information extraction are complementary again.In order to handle mass text, information extraction system is logical
Often with information retrieval system(Such as text filtering)Output as input;And information extraction technique can be used for improving information inspection
The performance of cable system.The combination of the two can preferably serve the information handling needs of user.
Although information extraction needs to carry out a degree of understanding to text, but with real text understanding(Text
Understanding)Also it is different.In information extraction, user is generally only concerned limited interested factural information, and
It is indifferent to the deep understandings such as the nuance of the text meaning and the writing intention of author problem [1].Therefore, information extraction can only
The text understanding technology simplified in other words of a kind of shallow-layer at last.
In general, the process object of information extraction system is natural language text especially non-structured text.But it is wide
It is said in justice, other than e-text, the process object of information extraction system can also be other matchmakers such as voice, image, video
The data of body type.Herein, we only discuss that information extraction research in the narrow sense, that is, the information for being directed to natural language text are taken out
It takes.
In recent years, with the development of network, the information on internet is more and more.Almost all of network information be all with
What the form of structuring or semi-structured text was presented to the user.Web page information extraction is exactly related include in webpage
Information extraction comes out and carries out structuring processing, is allowed to become the same organizational form of table.The main task of webpage information is just
It is that scheduled information point is extracted from various webpages, is then integrated in the form of unified, facilitate inspection
It looks into and compares.
On the internet, the information of same subject usually dispersion is stored on different websites, the form of performance also it is each not
It is identical, in the prior art, it is difficult to which expected web mining is complete.In addition, on internet, information is reprinted frequently how
The normalization of realization duplicate message and a key.
Invention content
The embodiment of the present invention is designed to provide a kind of data high-speed processing conversion communication means and device, can effectively take out
Web data is taken, and duplicate message is normalized, user is facilitated to efficiently use web data.
In order to achieve the above object, described an embodiment of the present invention provides a kind of data high-speed processing conversion communication means
Method includes step:
According to preset data collecting rule, web data is collected;
Collected web data is filtered and normalized, obtains garbled data;
Using default disaggregated model, classify to the garbled data obtained, obtains sorted K classes data;
It is handled using radio frequency high-speed data in the fpga chip in board, high speed processing is carried out to sorted K classes data.
Optionally, described that web data is collected according to preset data collecting rule, including:
Data acquisition network page is customized according to target;
According to structure of web page, web page body data block is determined, automatically generate web data extraction template and extract web data.
Optionally, described collected web data to be filtered and normalized, obtain garbled data step
After rapid, the method further includes:
Each section of text of the garbled data is encoded, segmentation comparison is carried out according to coding, judges Data duplication degree;
Duplicate data is normalized, garbled data.
Optionally, described that data are uniformly stored to and established index according to classification and cluster result, form big data
Library, including:
According to classification and cluster result, classifies to K class data, the data for being included in each data class are gathered
Data are uniformly stored and are established index by class, form large database concept.
Optionally, described that collected web data is filtered, including:
Using Bloom filter, collected web data is filtered.
Optionally, according to classification results, large database concept is divided into two topic, data class ranks, carries out on this basis
Two kinds of clusterings.
Optionally, according to classification results, large database concept is subdivided into four topic, topic cluster, data class, data class cluster grades
Not, the four kinds of clusterings carried out on this basis.
A kind of data high-speed processing conversion communication device, which is characterized in that including:
Collection module, for according to preset data collecting rule, collecting web data;
Module is obtained, for being filtered to collected web data and normalized, obtains garbled data;
Sort module, for using disaggregated model is preset, classifying to the garbled data obtained, obtaining sorted K classes
Data;
Processing module, for using in the fpga chip in radio frequency high-speed data processing board, being carried out to sorted K classes data
High speed processing.
Advantageous effect:
A kind of data high-speed processing conversion communication means and device provided in an embodiment of the present invention, extract the mode of web data,
Efficient, recall ratio is good, and information is avoided to omit;Duplicate message can be effectively eliminated, data is greatly reduced and is taken up space, is eliminated
Redundancy reduces the load of subsequent processing, improves data-handling efficiency;Prefabricated disaggregated model and clustering algorithm, to data into
Row classification and clustering, the unified storage of data establish database and establish database index, facilitate user to extracting data
Management, retrieval and utilization.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
Obtain other attached drawings according to these attached drawings.
Fig. 1 is the first flow diagram of data high-speed processing conversion communication means.
Fig. 2 is second of flow diagram of diagram data processing method.
Fig. 3 is the structural schematic diagram of diagram data processing unit.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
Below by specific embodiment, the present invention will be described in detail.
Referring to Fig. 1, the flow diagram of communication means is converted for data high-speed provided by the invention processing, including enter step
It is as follows:
S101 collects web data according to preset data collecting rule;
S102, is filtered collected web data and normalized, obtains garbled data;
S103 classifies to the garbled data obtained using default disaggregated model, obtains sorted K classes data;
S104 is handled using radio frequency high-speed data in the fpga chip in board, is carried out at high speed to sorted K classes data
Reason.
In present embodiment, data acquisition network page is customized according to target, acquires mode there are two types of the sources of webpage,
Referring to Fig. 2, respectively:
S201, webpage is as data source in prefabricated industry;
The network probe of built-in domain body is arranged in S202, automatic to find with ontology related web page as collection point.
The prefabricated of data source pays close attention to webpage expected from user so that the draw-off direction of web data more has
Specific aim is conducive to improve data acquisition efficiency.Collection point it is complete can to improve looking into for data acquisition at last to the supplement of data source
Rate.The complementation of data source and collection point may make data acquisition efficiency and recall ratio to reach a more satisfactory balance.
This text carries out segment encoding, and carries out segmentation comparison, can effectively find that text repeats degree, avoid omitting.
In present embodiment, according to classification and cluster result, data are uniformly stored to and established index, form big number
According to library, it is specifically divided into:
N number of data class is clustered;
The data for being included in each data class are clustered.
Further, described that collected web data is filtered, including:
Using Bloom filter, collected web data is filtered.
According to classification results, database is divided into two topic, data class ranks, the two kinds of clusters carried out on this basis
Database, can be subdivided into four topic, topic cluster, data class, data class cluster ranks, further establish Indexing Mechanism by analysis,
So that user to the management of database, retrieval, using more convenient.
In addition, the present invention also provides a kind of data high-speed processing conversion communication devices, including:
Collection module 301, for according to preset data collecting rule, collecting web data;
Module 302 is obtained, for being filtered to collected web data and normalized, obtains garbled data;
Sort module 303, for using disaggregated model is preset, classifying to the garbled data obtained, obtaining sorted K
Class data;
Processing module 304, for being handled using radio frequency high-speed data in the fpga chip in board, to sorted K classes data
Carry out high speed processing.
It should be noted that herein, relational terms such as first and second and the like are used merely to a reality
Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation
In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to
Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those
Element, but also include other elements that are not explicitly listed, or further include for this process, method, article or equipment
Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that
There is also other identical elements in process, method, article or equipment including the element.
Each embodiment in this specification is all made of relevant mode and describes, identical similar portion between each embodiment
Point just to refer each other, and each embodiment focuses on the differences from other embodiments.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all
Any modification, equivalent replacement, improvement and so within the spirit and principles in the present invention, are all contained in protection scope of the present invention
It is interior.
Claims (8)
1. a kind of data high-speed processing conversion communication means, which is characterized in that the method includes the steps:
According to preset data collecting rule, web data is collected;
Collected web data is filtered and normalized, obtains garbled data;
Using default disaggregated model, classify to the garbled data obtained, obtains sorted K classes data;
It is handled using radio frequency high-speed data in the fpga chip in board, high speed processing is carried out to sorted K classes data.
2. a kind of data high-speed processing conversion communication means according to claim 1, which is characterized in that the basis is default
Data collecting rule, collect web data, including:
Data acquisition network page is customized according to target;
According to structure of web page, web page body data block is determined, automatically generate web data extraction template and extract web data.
3. a kind of data high-speed processing conversion communication means according to claim 1, which is characterized in that described to collected
To web data be filtered and normalized, after obtaining garbled data step, the method further includes:
Each section of text of the garbled data is encoded, segmentation comparison is carried out according to coding, judges Data duplication degree;
Duplicate data is normalized, garbled data.
4. a kind of data high-speed processing conversion communication means according to claim 1, which is characterized in that described according to classification
And cluster result, data are uniformly stored to and established index, form large database concept, including:
According to classification and cluster result, classifies to K class data, the data for being included in each data class are gathered
Data are uniformly stored and are established index by class, form large database concept.
5. a kind of data high-speed processing conversion communication means according to claim 1, which is characterized in that described to collected
To web data be filtered, including:
Using Bloom filter, collected web data is filtered.
6. a kind of data high-speed processing conversion communication means according to claim 4, which is characterized in that tied according to classification
Fruit, large database concept are divided into two topic, data class ranks, the two kinds of clusterings carried out on this basis.
7. a kind of data high-speed processing conversion communication means according to claim 4, which is characterized in that tied according to classification
Fruit, large database concept are subdivided into four topic, topic cluster, data class, data class cluster ranks, the four kinds of clusters carried out on this basis
Analysis.
8. a kind of data high-speed processing conversion communication device, which is characterized in that including:
Collection module, for according to preset data collecting rule, collecting web data;
Module is obtained, for being filtered to collected web data and normalized, obtains garbled data;
Sort module, for using disaggregated model is preset, classifying to the garbled data obtained, obtaining sorted K classes
Data;
Processing module, for using in the fpga chip in radio frequency high-speed data processing board, being carried out to sorted K classes data
High speed processing.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810096708.XA CN108399205A (en) | 2018-01-31 | 2018-01-31 | A kind of data high-speed processing conversion communication means and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810096708.XA CN108399205A (en) | 2018-01-31 | 2018-01-31 | A kind of data high-speed processing conversion communication means and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108399205A true CN108399205A (en) | 2018-08-14 |
Family
ID=63095837
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810096708.XA Withdrawn CN108399205A (en) | 2018-01-31 | 2018-01-31 | A kind of data high-speed processing conversion communication means and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108399205A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112989791A (en) * | 2021-03-30 | 2021-06-18 | 北京拓普丰联信息工程有限公司 | Duplication eliminating method, system and medium based on text information extraction result |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102999465A (en) * | 2012-10-24 | 2013-03-27 | 绵阳市维博电子有限责任公司 | High-speed digital signal integrated processing device for wireless communication |
CN104182465A (en) * | 2014-07-21 | 2014-12-03 | 安徽华贞信息科技有限公司 | Network-based big data processing method |
CN106776693A (en) * | 2016-11-10 | 2017-05-31 | 福建中金在线信息科技有限公司 | A kind of website data acquisition method and device |
CN107391768A (en) * | 2017-09-12 | 2017-11-24 | 广州酷狗计算机科技有限公司 | Web data processing method, device, equipment and computer-readable recording medium |
CN107577724A (en) * | 2017-08-22 | 2018-01-12 | 佛山市高研信息技术有限公司 | A kind of big data processing method |
-
2018
- 2018-01-31 CN CN201810096708.XA patent/CN108399205A/en not_active Withdrawn
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102999465A (en) * | 2012-10-24 | 2013-03-27 | 绵阳市维博电子有限责任公司 | High-speed digital signal integrated processing device for wireless communication |
CN104182465A (en) * | 2014-07-21 | 2014-12-03 | 安徽华贞信息科技有限公司 | Network-based big data processing method |
CN106776693A (en) * | 2016-11-10 | 2017-05-31 | 福建中金在线信息科技有限公司 | A kind of website data acquisition method and device |
CN107577724A (en) * | 2017-08-22 | 2018-01-12 | 佛山市高研信息技术有限公司 | A kind of big data processing method |
CN107391768A (en) * | 2017-09-12 | 2017-11-24 | 广州酷狗计算机科技有限公司 | Web data processing method, device, equipment and computer-readable recording medium |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112989791A (en) * | 2021-03-30 | 2021-06-18 | 北京拓普丰联信息工程有限公司 | Duplication eliminating method, system and medium based on text information extraction result |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104182465A (en) | Network-based big data processing method | |
CN107577724A (en) | A kind of big data processing method | |
CN103186663A (en) | Video-based online public opinion monitoring method and system | |
CN108304382B (en) | Quality analysis method and system based on text data mining in manufacturing process | |
Madichetty | Identification of medical resource tweets using majority voting-based ensemble during disaster | |
CN104834739B (en) | Internet information storage system | |
Ouyang et al. | Sentistory: multi-grained sentiment analysis and event summarization with crowdsourced social media data | |
CN105512300B (en) | information filtering method and system | |
CN108280213A (en) | A kind of analysis system of big data | |
Subramani et al. | Extracting actionable knowledge from domestic violence discourses on social media | |
KR20130037975A (en) | Method and apparatus for providing web trend analysis based on issue template extraction | |
CN108399205A (en) | A kind of data high-speed processing conversion communication means and device | |
EP3535661A2 (en) | A system for managing, analyzing, navigating or searching of data information across one or more sources within a computer or a computer network, without copying, moving or manipulating the source or the data information stored in the source | |
CN112559480A (en) | Distributed data set computing method and system in parallel computing scene | |
CN117171244A (en) | Enterprise data management system based on data middle platform construction and data analysis method thereof | |
KR102025813B1 (en) | Device and method for chronological big data curation system | |
KR20090033150A (en) | Ontology based index method and search engine using the same | |
CN111026940A (en) | Network public opinion and risk information monitoring system and electronic equipment for power grid electromagnetic environment | |
Kaufhold et al. | Big data and multi-platform social media services in disaster management | |
CN206224473U (en) | Information collection system | |
CN115221319A (en) | Self-defined event early warning monitoring method | |
CN114841155A (en) | Intelligent theme content aggregation method and device, electronic equipment and storage medium | |
Li et al. | Keyword analysis and topic extraction of hospital violence news | |
CN107577690A (en) | The recommendation method and recommendation apparatus of magnanimity information data | |
CN113971213A (en) | Smart city management public information sharing system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20180814 |