CN108399205A - A kind of data high-speed processing conversion communication means and device - Google Patents

A kind of data high-speed processing conversion communication means and device Download PDF

Info

Publication number
CN108399205A
CN108399205A CN201810096708.XA CN201810096708A CN108399205A CN 108399205 A CN108399205 A CN 108399205A CN 201810096708 A CN201810096708 A CN 201810096708A CN 108399205 A CN108399205 A CN 108399205A
Authority
CN
China
Prior art keywords
data
web
speed processing
communication means
garbled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201810096708.XA
Other languages
Chinese (zh)
Inventor
李永敢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Foshan Gaicheng Intellectual Property Service Co Ltd
Original Assignee
Foshan Gaicheng Intellectual Property Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Foshan Gaicheng Intellectual Property Service Co Ltd filed Critical Foshan Gaicheng Intellectual Property Service Co Ltd
Priority to CN201810096708.XA priority Critical patent/CN108399205A/en
Publication of CN108399205A publication Critical patent/CN108399205A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

An embodiment of the present invention provides a kind of processing of data high-speed to convert communication means, the method includes the steps:According to preset data collecting rule, web data is collected;Collected web data is filtered and normalized, obtains garbled data;Using default disaggregated model, classify to the garbled data obtained, obtains sorted K classes data;It is handled using radio frequency high-speed data in the fpga chip in board, high speed processing is carried out to sorted K classes data.Using the embodiment of the present invention, web data can be effectively extracted, and duplicate message is normalized, user is facilitated to efficiently use web data.

Description

A kind of data high-speed processing conversion communication means and device
Technical field
The present invention relates to electronic technology field more particularly to a kind of data high-speed processing conversion communication means and device.
Background technology
With the universal and internet of computer(WWW)Fast development, a large amount of information is in the form of electronic document It appears in face of people.In order to cope with the serious challenge that information explosion is brought, there is an urgent need to the tools of some automations to help people The information really needed is quickly found in magnanimity information source.Information extraction(Information Extraction)Research is just It generates in this background.
The major function of information extraction system is that specific factural information is extracted from text(factual information).For example, extracting the details of terrorist incident from news report:It is time, place, criminal, aggrieved Person, target, the weapon etc. used;The case where company's publication new product is extracted from Economic News:Company name, product Name, issuing time, properties of product etc.;Symptom, idagnostic logout, inspection result, prescription etc. are extracted from the medical records of patient Deng.In general, the information being extracted is described in the form of structuring, can be directly stored in database, for user inquire with And further analysis and utilization.
Information extraction field is an emerging research field, is generally referred to automatic from a given collection of document It identifies the type informations such as preset entity, relationship and event, and structured storage and management is carried out to these information Process.Information extraction has important application in many fields.
A research closely related with information extraction is information retrieval, but information extraction has differences with information retrieval, It is mainly manifested in three aspects:
1. function is different.Information retrieval system is mainly found and the relevant document of user demand from a large amount of collection of document List;And information extraction system is then intended to directly obtain the interested factural information of user from text.
2. treatment technology is different.Information retrieval system usually using technologies such as statistics and Keywords matchings, is regarded text as The set of word(bags of words), in-depth analysis understanding need not be carried out to text;And information extraction often will be by nature Language processing techniques, by text sentence and chapter carry out analyzing processing after could complete.
3. suitable application area is different.Since the technology of use is different, information retrieval system is typically that field is unrelated, and believes It is then that field is relevant to cease extraction system, can only extraction system pre-set limited kinds the fact information.
On the other hand, information retrieval and information extraction are complementary again.In order to handle mass text, information extraction system is logical Often with information retrieval system(Such as text filtering)Output as input;And information extraction technique can be used for improving information inspection The performance of cable system.The combination of the two can preferably serve the information handling needs of user.
Although information extraction needs to carry out a degree of understanding to text, but with real text understanding(Text Understanding)Also it is different.In information extraction, user is generally only concerned limited interested factural information, and It is indifferent to the deep understandings such as the nuance of the text meaning and the writing intention of author problem [1].Therefore, information extraction can only The text understanding technology simplified in other words of a kind of shallow-layer at last.
In general, the process object of information extraction system is natural language text especially non-structured text.But it is wide It is said in justice, other than e-text, the process object of information extraction system can also be other matchmakers such as voice, image, video The data of body type.Herein, we only discuss that information extraction research in the narrow sense, that is, the information for being directed to natural language text are taken out It takes.
In recent years, with the development of network, the information on internet is more and more.Almost all of network information be all with What the form of structuring or semi-structured text was presented to the user.Web page information extraction is exactly related include in webpage Information extraction comes out and carries out structuring processing, is allowed to become the same organizational form of table.The main task of webpage information is just It is that scheduled information point is extracted from various webpages, is then integrated in the form of unified, facilitate inspection It looks into and compares.
On the internet, the information of same subject usually dispersion is stored on different websites, the form of performance also it is each not It is identical, in the prior art, it is difficult to which expected web mining is complete.In addition, on internet, information is reprinted frequently how The normalization of realization duplicate message and a key.
Invention content
The embodiment of the present invention is designed to provide a kind of data high-speed processing conversion communication means and device, can effectively take out Web data is taken, and duplicate message is normalized, user is facilitated to efficiently use web data.
In order to achieve the above object, described an embodiment of the present invention provides a kind of data high-speed processing conversion communication means Method includes step:
According to preset data collecting rule, web data is collected;
Collected web data is filtered and normalized, obtains garbled data;
Using default disaggregated model, classify to the garbled data obtained, obtains sorted K classes data;
It is handled using radio frequency high-speed data in the fpga chip in board, high speed processing is carried out to sorted K classes data.
Optionally, described that web data is collected according to preset data collecting rule, including:
Data acquisition network page is customized according to target;
According to structure of web page, web page body data block is determined, automatically generate web data extraction template and extract web data.
Optionally, described collected web data to be filtered and normalized, obtain garbled data step After rapid, the method further includes:
Each section of text of the garbled data is encoded, segmentation comparison is carried out according to coding, judges Data duplication degree; Duplicate data is normalized, garbled data.
Optionally, described that data are uniformly stored to and established index according to classification and cluster result, form big data Library, including:
According to classification and cluster result, classifies to K class data, the data for being included in each data class are gathered Data are uniformly stored and are established index by class, form large database concept.
Optionally, described that collected web data is filtered, including:
Using Bloom filter, collected web data is filtered.
Optionally, according to classification results, large database concept is divided into two topic, data class ranks, carries out on this basis Two kinds of clusterings.
Optionally, according to classification results, large database concept is subdivided into four topic, topic cluster, data class, data class cluster grades Not, the four kinds of clusterings carried out on this basis.
A kind of data high-speed processing conversion communication device, which is characterized in that including:
Collection module, for according to preset data collecting rule, collecting web data;
Module is obtained, for being filtered to collected web data and normalized, obtains garbled data;
Sort module, for using disaggregated model is preset, classifying to the garbled data obtained, obtaining sorted K classes Data;
Processing module, for using in the fpga chip in radio frequency high-speed data processing board, being carried out to sorted K classes data High speed processing.
Advantageous effect:
A kind of data high-speed processing conversion communication means and device provided in an embodiment of the present invention, extract the mode of web data, Efficient, recall ratio is good, and information is avoided to omit;Duplicate message can be effectively eliminated, data is greatly reduced and is taken up space, is eliminated Redundancy reduces the load of subsequent processing, improves data-handling efficiency;Prefabricated disaggregated model and clustering algorithm, to data into Row classification and clustering, the unified storage of data establish database and establish database index, facilitate user to extracting data Management, retrieval and utilization.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with Obtain other attached drawings according to these attached drawings.
Fig. 1 is the first flow diagram of data high-speed processing conversion communication means.
Fig. 2 is second of flow diagram of diagram data processing method.
Fig. 3 is the structural schematic diagram of diagram data processing unit.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
Below by specific embodiment, the present invention will be described in detail.
Referring to Fig. 1, the flow diagram of communication means is converted for data high-speed provided by the invention processing, including enter step It is as follows:
S101 collects web data according to preset data collecting rule;
S102, is filtered collected web data and normalized, obtains garbled data;
S103 classifies to the garbled data obtained using default disaggregated model, obtains sorted K classes data;
S104 is handled using radio frequency high-speed data in the fpga chip in board, is carried out at high speed to sorted K classes data Reason.
In present embodiment, data acquisition network page is customized according to target, acquires mode there are two types of the sources of webpage, Referring to Fig. 2, respectively:
S201, webpage is as data source in prefabricated industry;
The network probe of built-in domain body is arranged in S202, automatic to find with ontology related web page as collection point.
The prefabricated of data source pays close attention to webpage expected from user so that the draw-off direction of web data more has Specific aim is conducive to improve data acquisition efficiency.Collection point it is complete can to improve looking into for data acquisition at last to the supplement of data source Rate.The complementation of data source and collection point may make data acquisition efficiency and recall ratio to reach a more satisfactory balance.
This text carries out segment encoding, and carries out segmentation comparison, can effectively find that text repeats degree, avoid omitting.
In present embodiment, according to classification and cluster result, data are uniformly stored to and established index, form big number According to library, it is specifically divided into:
N number of data class is clustered;
The data for being included in each data class are clustered.
Further, described that collected web data is filtered, including:
Using Bloom filter, collected web data is filtered.
According to classification results, database is divided into two topic, data class ranks, the two kinds of clusters carried out on this basis Database, can be subdivided into four topic, topic cluster, data class, data class cluster ranks, further establish Indexing Mechanism by analysis, So that user to the management of database, retrieval, using more convenient.
In addition, the present invention also provides a kind of data high-speed processing conversion communication devices, including:
Collection module 301, for according to preset data collecting rule, collecting web data;
Module 302 is obtained, for being filtered to collected web data and normalized, obtains garbled data;
Sort module 303, for using disaggregated model is preset, classifying to the garbled data obtained, obtaining sorted K Class data;
Processing module 304, for being handled using radio frequency high-speed data in the fpga chip in board, to sorted K classes data Carry out high speed processing.
It should be noted that herein, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also include other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.
Each embodiment in this specification is all made of relevant mode and describes, identical similar portion between each embodiment Point just to refer each other, and each embodiment focuses on the differences from other embodiments.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Any modification, equivalent replacement, improvement and so within the spirit and principles in the present invention, are all contained in protection scope of the present invention It is interior.

Claims (8)

1. a kind of data high-speed processing conversion communication means, which is characterized in that the method includes the steps:
According to preset data collecting rule, web data is collected;
Collected web data is filtered and normalized, obtains garbled data;
Using default disaggregated model, classify to the garbled data obtained, obtains sorted K classes data;
It is handled using radio frequency high-speed data in the fpga chip in board, high speed processing is carried out to sorted K classes data.
2. a kind of data high-speed processing conversion communication means according to claim 1, which is characterized in that the basis is default Data collecting rule, collect web data, including:
Data acquisition network page is customized according to target;
According to structure of web page, web page body data block is determined, automatically generate web data extraction template and extract web data.
3. a kind of data high-speed processing conversion communication means according to claim 1, which is characterized in that described to collected To web data be filtered and normalized, after obtaining garbled data step, the method further includes:
Each section of text of the garbled data is encoded, segmentation comparison is carried out according to coding, judges Data duplication degree; Duplicate data is normalized, garbled data.
4. a kind of data high-speed processing conversion communication means according to claim 1, which is characterized in that described according to classification And cluster result, data are uniformly stored to and established index, form large database concept, including:
According to classification and cluster result, classifies to K class data, the data for being included in each data class are gathered Data are uniformly stored and are established index by class, form large database concept.
5. a kind of data high-speed processing conversion communication means according to claim 1, which is characterized in that described to collected To web data be filtered, including:
Using Bloom filter, collected web data is filtered.
6. a kind of data high-speed processing conversion communication means according to claim 4, which is characterized in that tied according to classification Fruit, large database concept are divided into two topic, data class ranks, the two kinds of clusterings carried out on this basis.
7. a kind of data high-speed processing conversion communication means according to claim 4, which is characterized in that tied according to classification Fruit, large database concept are subdivided into four topic, topic cluster, data class, data class cluster ranks, the four kinds of clusters carried out on this basis Analysis.
8. a kind of data high-speed processing conversion communication device, which is characterized in that including:
Collection module, for according to preset data collecting rule, collecting web data;
Module is obtained, for being filtered to collected web data and normalized, obtains garbled data;
Sort module, for using disaggregated model is preset, classifying to the garbled data obtained, obtaining sorted K classes Data;
Processing module, for using in the fpga chip in radio frequency high-speed data processing board, being carried out to sorted K classes data High speed processing.
CN201810096708.XA 2018-01-31 2018-01-31 A kind of data high-speed processing conversion communication means and device Withdrawn CN108399205A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810096708.XA CN108399205A (en) 2018-01-31 2018-01-31 A kind of data high-speed processing conversion communication means and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810096708.XA CN108399205A (en) 2018-01-31 2018-01-31 A kind of data high-speed processing conversion communication means and device

Publications (1)

Publication Number Publication Date
CN108399205A true CN108399205A (en) 2018-08-14

Family

ID=63095837

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810096708.XA Withdrawn CN108399205A (en) 2018-01-31 2018-01-31 A kind of data high-speed processing conversion communication means and device

Country Status (1)

Country Link
CN (1) CN108399205A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112989791A (en) * 2021-03-30 2021-06-18 北京拓普丰联信息工程有限公司 Duplication eliminating method, system and medium based on text information extraction result

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999465A (en) * 2012-10-24 2013-03-27 绵阳市维博电子有限责任公司 High-speed digital signal integrated processing device for wireless communication
CN104182465A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Network-based big data processing method
CN106776693A (en) * 2016-11-10 2017-05-31 福建中金在线信息科技有限公司 A kind of website data acquisition method and device
CN107391768A (en) * 2017-09-12 2017-11-24 广州酷狗计算机科技有限公司 Web data processing method, device, equipment and computer-readable recording medium
CN107577724A (en) * 2017-08-22 2018-01-12 佛山市高研信息技术有限公司 A kind of big data processing method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999465A (en) * 2012-10-24 2013-03-27 绵阳市维博电子有限责任公司 High-speed digital signal integrated processing device for wireless communication
CN104182465A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Network-based big data processing method
CN106776693A (en) * 2016-11-10 2017-05-31 福建中金在线信息科技有限公司 A kind of website data acquisition method and device
CN107577724A (en) * 2017-08-22 2018-01-12 佛山市高研信息技术有限公司 A kind of big data processing method
CN107391768A (en) * 2017-09-12 2017-11-24 广州酷狗计算机科技有限公司 Web data processing method, device, equipment and computer-readable recording medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112989791A (en) * 2021-03-30 2021-06-18 北京拓普丰联信息工程有限公司 Duplication eliminating method, system and medium based on text information extraction result

Similar Documents

Publication Publication Date Title
CN104182465A (en) Network-based big data processing method
CN107577724A (en) A kind of big data processing method
CN103186663A (en) Video-based online public opinion monitoring method and system
CN108304382B (en) Quality analysis method and system based on text data mining in manufacturing process
Madichetty Identification of medical resource tweets using majority voting-based ensemble during disaster
CN104834739B (en) Internet information storage system
Ouyang et al. Sentistory: multi-grained sentiment analysis and event summarization with crowdsourced social media data
CN105512300B (en) information filtering method and system
CN108280213A (en) A kind of analysis system of big data
Subramani et al. Extracting actionable knowledge from domestic violence discourses on social media
KR20130037975A (en) Method and apparatus for providing web trend analysis based on issue template extraction
CN108399205A (en) A kind of data high-speed processing conversion communication means and device
EP3535661A2 (en) A system for managing, analyzing, navigating or searching of data information across one or more sources within a computer or a computer network, without copying, moving or manipulating the source or the data information stored in the source
CN112559480A (en) Distributed data set computing method and system in parallel computing scene
CN117171244A (en) Enterprise data management system based on data middle platform construction and data analysis method thereof
KR102025813B1 (en) Device and method for chronological big data curation system
KR20090033150A (en) Ontology based index method and search engine using the same
CN111026940A (en) Network public opinion and risk information monitoring system and electronic equipment for power grid electromagnetic environment
Kaufhold et al. Big data and multi-platform social media services in disaster management
CN206224473U (en) Information collection system
CN115221319A (en) Self-defined event early warning monitoring method
CN114841155A (en) Intelligent theme content aggregation method and device, electronic equipment and storage medium
Li et al. Keyword analysis and topic extraction of hospital violence news
CN107577690A (en) The recommendation method and recommendation apparatus of magnanimity information data
CN113971213A (en) Smart city management public information sharing system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20180814