CN106777395A - A kind of topic based on community's text data finds system - Google Patents
A kind of topic based on community's text data finds system Download PDFInfo
- Publication number
- CN106777395A CN106777395A CN201710115832.1A CN201710115832A CN106777395A CN 106777395 A CN106777395 A CN 106777395A CN 201710115832 A CN201710115832 A CN 201710115832A CN 106777395 A CN106777395 A CN 106777395A
- Authority
- CN
- China
- Prior art keywords
- data
- community
- text
- text data
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 claims abstract description 70
- 238000007726 management method Methods 0.000 claims abstract description 21
- 238000007781 pre-processing Methods 0.000 claims abstract description 17
- 238000013500 data storage Methods 0.000 claims abstract description 16
- 238000013079 data visualisation Methods 0.000 claims abstract description 15
- 238000013523 data management Methods 0.000 claims abstract description 14
- 239000000284 extract Substances 0.000 claims abstract description 10
- 239000013598 vector Substances 0.000 claims description 42
- 238000009826 distribution Methods 0.000 claims description 14
- 238000000034 method Methods 0.000 claims description 13
- 230000011218 segmentation Effects 0.000 claims description 13
- 238000004140 cleaning Methods 0.000 claims description 12
- 238000003860 storage Methods 0.000 claims description 12
- 238000012549 training Methods 0.000 claims description 9
- 238000007405 data analysis Methods 0.000 claims description 8
- 230000036541 health Effects 0.000 claims description 8
- 238000010586 diagram Methods 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 5
- 239000000203 mixture Substances 0.000 claims description 4
- 238000012986 modification Methods 0.000 claims description 3
- 230000004048 modification Effects 0.000 claims description 3
- 238000013439 planning Methods 0.000 claims description 3
- 230000003111 delayed effect Effects 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000010276 construction Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 241000894007 species Species 0.000 description 2
- 244000258070 Salvia viridis Species 0.000 description 1
- 235000001486 Salvia viridis Nutrition 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000010438 heat treatment Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 238000011524 similarity measure Methods 0.000 description 1
- 230000006641 stabilisation Effects 0.000 description 1
- 238000011105 stabilization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Business, Economics & Management (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Economics (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
System, including mobile terminal service system and server-side system are found the invention discloses a kind of topic based on community's text data;Mobile terminal service system includes community's text data uploading module, it is responsible for the collection of community's text data, extracts and upload, receive community's text data of collection, and the type of community's text data is extracted, and community's text data and type are served the data preprocessing module for reaching server-side system;The type of community's text data includes TXT forms, html format, XML format;Server-side system includes data preprocessing module, vectorial extraction module, much-talked-about topic extraction module, data visualization module and data storage and management module;The present invention makes community resident, and community service personnel and municipal administration are participated in community management, accelerates operating efficiency, realizes that the wisdomization management of community provides efficient service.
Description
Technical field
System and method is found the present invention relates to a kind of topic based on community's text data, belongs to computer and network skill
Art application field.
Background technology
With the Fast Construction with urban informationization and development, national economy significant progress, in high speed Urbanization Construction
In bring the problems such as population management, urban transportation, environmental protection and social security, hindered the pin of urban development
Step.The construction of community is the basis of urban construction, improves the good fortune that community service quality is directly connected to community resident's daily life
Good fortune index.How to allow the community resident really to enjoy the bonus that smart city is brought, be the top priority of the building of communities, so
Community is made full use of to produce data.
The demand of understanding community service is deeply gone, is analyzed from text data aspect, text data digging is related at present
Pick is generally public sentiment system or the analysis for doing key person or netizen's mood, pin is analyzed to instantly more popular micro-blog
Do not have to the data that community inside and city manager produce and effectively use, currently without pertinent literature report.
The content of the invention
The technical problem to be solved in the present invention:Overcome the deficiencies in the prior art, there is provided a kind of based on community's text data
Topic finds system and method, it is possible to increase community service level, meet community management feature, easy to use, can make community
Resident, community service personnel and municipal administration are participated in community management, accelerate operating efficiency, realize the wisdomization management of community.
One of the technical solution adopted by the present invention:A kind of topic based on community's text data finds system, including movement
Based terminal system and server-side system;Mobile terminal service system includes community's text data uploading module, is responsible for community
Collection, extraction and the upload of text data, receive community's text data of collection, and extract the class of community's text data
Type, and community's text data and type are served the data preprocessing module for reaching server-side system;Community's textual data
According to type include TXT forms, html format, XML format;Server-side system includes that data preprocessing module, vector are extracted
Module, much-talked-about topic extraction module, data visualization module and data storage and management module;
Data preprocessing module:Community's text data that community's text data uploading module is uploaded is read, and carries out community
The cleaning of text data and Chinese word segmentation;Community's text data reads and takes different readings for different types of data
Strategy, the community's text data for TXT forms directly reads into data-stream form using BufferedReader in JAVA, right
The interpretive model of DOM is used in community's text data of HTML and XML format;It is right that the cleaning of community's text data is completed
The carrying out for having repetition to report in community's text data is rejected;The Chinese word segmentation be by community's text data after cleaning be cut into by
The word feature vector of Chinese language words composition;
Vectorial extraction module:It is responsible for representing community's text data vectorization;Based on Chinese corpus to data prediction
The word feature vector obtained after module is trained, and extracts keyword phrase, and calculate keyword weights;Bluebeard compound feature to
Amount and keyword weights are weighted average computation and draw Text eigenvector;The word feature vector training uses Word2Vec
Version of java, Chinese corpus use project location news corpus storehouse (in this project corpus for Shaanxi Province's news with
Zhejiang Province's news, or search dog news corpus storehouse);Keyword phrase is mentioned using TF-IDF feature extractions;
Much-talked-about topic extraction module:Based on the keyword phrase drawn in vectorial extraction module and the characteristic vector of text,
Text is clustered using Single-Pass clusters, is obtained after class cluster according to the keyword extracted in vectorial extraction module
Phrase, counts to the keyword in keyword phrase, descending arrangement after the completion of statistics, so as to generate much-talked-about topic;
Data visualization module:It is User Interface, completes external application and displaying task, community's text is uploaded mould
The much-talked-about topic data display obtained in the community's text data for obtaining and much-talked-about topic extraction module in block is at page end;Exhibition
The focus words generated after data total amount overview of the content shown including community's text data, data distribution overview and data analysis
Topic, the display form of much-talked-about topic uses form, block diagram, broken line graph and combines various display forms of map, intuitively
Display data;The data entirety overview display data overview, the total species of data, data total amount and each regional, theme
Total amount;Data distribution overview shows data distribution situation on map, intuitively display data geographical position;Data point
The much-talked-about topic generated after treatment is showed Community Administrators by analysis overview using form, chart and with reference to forms such as maps;
Data storage and management module, it is pre- to community's text data of upload, data in community's text data uploading module
The related data produced in processing module, vectorial extraction module and much-talked-about topic extraction module carries out storage and management;Community
Community's text data that text data uploading module is uploaded is buffered in HDFS file system, the word trained in vectorial extraction module
Vector result is cached in Redis cache databases, and the data of data preprocessing module and the generation of much-talked-about topic extraction module are delayed
In there is HBase databases;HDFS file system, HBase databases and Redis cache databases are managed, it is completed
In data increase, delete, modification and inquiry operation;Support timed task to the caching number in Redis cache databases simultaneously
According to being updated, and the concordance list in HBase databases is safeguarded, optimize the inquiry to data, to HDFS file system Chinese
The storage of part block merges optimization processing.
Community's text data includes condition of the people data and municipal administration's data, and the condition of the people data include community environment health master
Topic data, Community Safety subject data, community's parking and bulletin subject data and community management subject data, community management master
Topic data refer to property and the daily law enforcement situation of grid person, and condition of the people data are uploaded by community resident and community grid person;Municipal administration's data
Uploaded by municipal administration personnel, municipal administration's data include city appearance environment health subject data, municipal subject data, urban afforestation theme
Data, urban planning subject data and industrial and commercial administration subject data.
The two of the technology of the present invention solution:A kind of topic based on community's text data finds method, including:Community's text
Notebook data uploading step, data prediction step, vectorial extraction step, much-talked-about topic extraction step, data visualization step, data
Storage and management step, wherein community's text data uploading step are completed by mobile terminal service system, mobile terminal service system
Community's text data of system collection, extracts the type of community's text data, and community's text data and type are served into biography
To server-side system;Community's text data includes condition of the people data and municipal administration's data, and the condition of the people data include community's ring
Border health subject data, Community Safety subject data, community's parking and bulletin subject data and community management subject data, society
Area's management subject data refers to property and the daily law enforcement situation of grid person, and condition of the people data are uploaded by community resident and community grid person;
Municipal administration's data are uploaded by municipal administration personnel, and municipal administration's data include city appearance environment health subject data, municipal subject data, city
Greening subject data, urban planning subject data and industrial and commercial administration subject data;The type of community's text data includes TXT
Form, html format, XML format;Server-side system carries out data preprocessing module, vectorial extraction module, much-talked-about topic and carries
The execution of modulus block, data visualization module and data storage and management module;
Data prediction step:Community's text data of community's text data uploading step is read, and carries out community's text
The cleaning of data and Chinese word segmentation;Community's text data reads and takes different reading plans for different types of data
Slightly, the community's text data for TXT forms directly reads into number using BufferedReader (caching reader) in JAVA
According to manifold formula, the community's text data for HTML and XML format uses the interpretive model of DOM (DOM Document Object Model);Institute
The cleaning for stating community's text data is completed to there is the carrying out that repetition is reported to reject in community's text data;The Chinese word segmentation be by
Community's text data is cut into the word feature vector being made up of Chinese language words after cleaning;
Vectorial extraction step:The word feature vector obtained after data prediction step is instructed based on Chinese corpus
Practice, extract keyword phrase, and calculate keyword weights;Average meter is weighted with reference to word feature vector and keyword weights
Calculation draws Text eigenvector;The word feature vector training uses the version of java of Word2Vec, Chinese corpus to use item
The news corpus storehouse in mesh location;Keyword phrase is mentioned using TF-IDF (term frequency-inverse document frequency) feature extraction;
Much-talked-about topic extraction step:Based on the keyword phrase drawn in vectorial extraction step and the characteristic vector of text,
Text is clustered using Single-Pass clusters, is obtained after class cluster according to the keyword extracted in vectorial extraction step
Phrase, counts to the keyword in keyword phrase, descending arrangement after the completion of statistics, so as to generate much-talked-about topic;
Data visualization step:It is User Interface, completes external application and displaying task, community's text is uploaded and is walked
The much-talked-about topic data display obtained in the community's text data for obtaining and much-talked-about topic extraction step in rapid is at page end;Exhibition
The focus words generated after data total amount overview of the content shown including community's text data, data distribution overview and data analysis
Topic, the display form of much-talked-about topic uses form, block diagram, broken line graph and combines various display forms of map, intuitively
Display data;The data entirety overview display data overview, the total species of data, data total amount and each regional, theme
Total amount;Data distribution overview shows data distribution situation on map, intuitively display data geographical position;Data point
The much-talked-about topic generated after treatment is showed Community Administrators by analysis overview using form, chart and with reference to forms such as maps;
Data storage and management step, to the community text data step, the number that are uploaded in community's text data uploading step
The related data produced in Data preprocess step, vectorial extraction step and much-talked-about topic extraction step carries out storage and management;
Community's text data that community's text data uploading step is uploaded is buffered in HDFS (distributed file system) file system, to
The term vector result cache trained in amount extraction step in Redis cache databases, data prediction step and much-talked-about topic
The data buffer storage of extraction step generation is in HBase databases;To HDFS file system, HBase (the distributed data towards row
Storehouse) database and Redis (database of key-value form) cache database be managed, and completing data therein increases, deletes
Except, modification and inquiry operation;Simultaneously support timed task to Redis cache databases in it is data cached be updated, and
The concordance list in HBase databases is safeguarded, optimizes the inquiry to data, the storage to blocks of files in HDFS file system is closed
And optimization processing.Access HDFS and HBase and use original API, access Redis and use packaged Jedis JAR bags.
The present invention has advantageous effect in that compared with prior art:
(1) present invention uploads text data, city management people to find that community's much-talked-about topic is target by community resident
Member reports and is combined the three-dimensional extraction topic of various dimensions with reference to internet data, using the keyword extraction of comparative maturity
Technology, introduces term vector technology, chooses clustering algorithm for community's text feature and optimizes the more preferable Clustering Effect of completion to it,
System High Availabitity, cross-platform, expansible and support that PC ends and mobile phone terminal are browsed is ensured, wherein reporting the condition of the people and city with reference to mobile phone terminal
Pipe reported event, according to the data characteristicses of community's text, excavates more valuable topic information, and community residence is met as far as possible
The people and the demand of manager, realize that passive information issue is thought actively to provide service conversion, and the wording of community's text data is worth
Bigger is brought into play, and corresponding contribution is made to build harmony in communities.
(2) present invention is using mobile terminal and server interaction mode so that community resident, Community Administrators and municipal administration
The strategy that three is combined, proposes that a kind of much-talked-about topic of efficient, stabilization multiple terminals displaying finds system, is finally community
Resident provides service with manager, greatly improves community service quality;The characteristics of present invention takes into full account community's text, structure
The data support system that a convenience studies and judges community's state for Community Administrators is built.
(3) present invention proposes that community's much-talked-about topic finds thought, using Chinese words segmentation, keyword extraction techniques, word
Vectorial extractive technique and represent text will occur in community using vectorization finally by Single-Pass clustering algorithms
Keyword extraction more than the frequency out ultimately forms much-talked-about topic, to the experimental result of Single-Pass clustering algorithms optimization
For:
(4) to the less demanding of system configuration, not excessive human input has occupancy resource few to the present invention, cheap, easily
In popularization.
Brief description of the drawings
Fig. 1 is system composition schematic diagram of the invention;
Fig. 2 is data preprocessing module flow chart of the invention;
Fig. 3 is vectorial extraction module flow chart of the invention;
Fig. 4 is much-talked-about topic extraction module flow chart of the invention;
Fig. 5 is data visualization block flow diagram of the invention;
Fig. 6 is data storage and management function structure chart of the invention.
Specific embodiment
Below in conjunction with the accompanying drawings the present invention is further illustrated with specific embodiment.
As shown in figure 1, the present invention includes mobile terminal service system and server-side system;Mobile terminal service system bag
Community's text data uploading module is included, is responsible for the collection of community's text data, extracted and upload, receive community's textual data of collection
According to and extracting the type of community's text data, and community's text data and type are served reach server-side system
Data preprocessing module;The type of community's text data includes TXT forms, html format, XML format;Server end system
System includes data preprocessing module, vectorial extraction module, much-talked-about topic extraction module, data visualization module and data storage and pipe
Reason mould.
As shown in Fig. 2 data preprocessing module first carries out condition of the people data and municipal administration's data headed by handling process order
Read, including TXT forms, html format, XML format text data reading.Data cleansing function is mainly completed to the people
There is the carrying out that repetition is reported to reject in feelings data and municipal administration data.Next Chinese Word Segmentation is completed.
Data preprocessing module is implemented as follows:
(1) text data reading is carried out first, and the form that can read is TXT, HTML and XML, according to the tray for reading
Formula is different, and wherein TXT forms directly read into data-stream form, HTML and XML format using BufferedReader in JAVA
Data are using the interpretive model of DOM.
(2) next need to carry out the data for reading duplicate checking judgement, that is, before the text data for finding to read
Read and just weed out and read next, participle step had been entered if not reading.
(3) word segmentation processing is finally carried out to text data, Chinese word segmentation uses the version of java of ICTLAS, ICTLAS
It is the Chinese word segmentation Open-Source Tools of Inst. of Computing Techn. Academia Sinica's issue, cutting is carried out to text, most textual data at last
According to being cut into the characteristic vector that is made up of Chinese language words.
As shown in figure 3, vectorial abstraction function module data handling process can judge whether corpus updates first, this function
For timed task is performed once because term vector training once needs to expend or so 4 hours, when needing exist for updating for one month
It is also to do offline task treatment.The handling process of its core is the treatment to text data, first to complete using TFIDF and dictionary
The extraction of paired text keyword, extracts secondary vector, based on three weights and three of keyword in the data set for training
The characteristic vector of individual keyword, is weighted average computation so as to obtain the characteristic vector of text.
Vectorial extraction module is implemented as follows:
(1) first step is read by the data after pretreatment module treatment, judges whether corpus needs to update, if more
It is new then using the version of java re -training of Word2Vec, then carry out next step and extract keyword without updating.
(2) second step needs to extract keyword, uses TF-IDF feature extractions while calculating the weights of TF-IDF.
(3) the 3rd steps read the term vector set trained in the first step, under being chosen without if in term vector set
One keyword, reads out term vector if having in term vector set, chooses three keywords of weights highest in second step
And vector.
(4) weights and three characteristic vectors of keyword based on three keywords of acquisition in the 3rd step, are weighted flat
Calculate so as to obtain the characteristic vector of text.The circular of the text representation model based on semantic weighting is as follows:It is false
If it is t to need the text for representing, corresponding vector representation is y;The text is crucial by first three obtained after TF-IDF calculating
Word is respectively a1i、a2iAnd a3i, corresponding TF-IDF values respectively w1i, w2iAnd w3i, corresponding vector respectively x1={ x11;
x12;…;x150, x2={ x21;x22;…;x250And x3={ x31;x32;…;x350}.Then there is computing formula as follows:
As shown in figure 4, the data of much-talked-about topic extraction module input are the characteristic vector of text, if input is first
Data then sets up first topic, if not then with Single-Pass clustering algorithms cluster, cluster process Shi Dangyu centers
The similarity of point is more than VcWhen calculated in the distance for proceeding to such other point, obtain the i.e. similarity of such distance
Value, more than VnCorresponding topic is included into, otherwise newly-built topic counts keyword quantity, much-talked-about topic word is obtained based on number sequence
Group, forms much-talked-about topic word cloud.
Much-talked-about topic extraction module is implemented as follows:
(1) first step reads the Text eigenvector produced in vectorial extraction module, determines whether the first data, such as
Fruit is that the first data is then directly established as new topic, if not then carrying out next step text cluster operation.
(2) second step is clustered using Single-Pass clustering algorithms, and in Single-Pass clustering algorithms, point arrives class
Similarity (distance is nearer, and similarity is bigger) measurement typically has three kinds, is respectively new_distance_max, ave_
Distance and inner_distance_max, explains to above three parameter below.new_distance_max
Similarity:The new point for adding to the inside of class distance a little maximum;Ave_distance similarities:It is new to add
Point to the inside of existing class a little between distance average value;Inner_distance_max similarities:The new point for adding
To the inside of class a little between distance maximum;Define three threshold values and correspond to three similarities mentioned above respectively.
The threshold value of THRESHOLD_NEW correspondence new_distance_max similarities, THRESHOLD_AVE correspondence ave_distance phases
Like the threshold value of the threshold value inner_distance_max similarities corresponding with THRESHOLD_INNER of degree.Assuming that the new text for adding
This is Di, existing class M1。DiTo M1New_distance_max similarities be simnew(Di,M1), DiTo M1Ave_
Distance similarities are simave(Di,M1), DiTo M1Inner_distance_max similarities be siminner(Di,M1), then
Average calculating formula of similarity is:
Wherein parameter meets λ+μ+η=1 and λ >=0, μ >=0, η >=0, when calculating three Similarity values respectively less than its threshold
During value text D is added then newiTo M1Similarity sim (Di,M1) it is 0, there is one in having three not less than its threshold value just
Meet λ simnew+μsimave+ηsiminnerLinear combination.Training result according to search dog news corpus storehouse works as λ+μ+η=1 and λ
When >=0, μ >=0, η >=0, in λ=0.2, μ=0.5, Clustering Effect is more satisfactory in the case of η=0.3.
(3) similarity of this Text eigenvector of calculating and existing class is needed, if similitude is similar to class central point
Angle value is more than Vc, then calculate the Similarity value of this and such other points, and preserve maximum similarity value, obtain this sample point with
The maximum of all kinds of similarities, if maximum with the Similarity value of class and more than VnThis sample point is then included into the class compared with it
In;If less than VnA then newly-built class;If less than VcThis Text eigenvector and other classes are then carried out into Similarity Measure.
(4) need if being included into existing topic class to count keyword quantity mirror image in second step, count every
Individual keyword has appearance altogether several times.
(5) the 4th steps need to determine whether the last item data, if the last item data are then based on the pass for counting
Keyword quantity descending is arranged, and forms much-talked-about topic word phrase, then jumps back to be input into Text eigenvector if not the last item.
As shown in figure 5, data visualization module is mainly responsible for the request for the treatment of application layer and presentation layer, in displaying whole system
Data total amount overview, data distribution overview and data analysis after the much-talked-about topic that generates, much-talked-about topic displaying aspect is using report
Table, chart simultaneously combine the forms such as map, and in community's life, service can also be furnished with picture.
Data visualization module is implemented as follows:
(1) first step is system demonstration request, that is, user please by carrying out data display after browser input network address
Ask.
(2) First page information content shown in second step, is respectively overall general data wherein there is three subtitles to connect
Condition, data distribution overview and data analysis overview.
It is exactly the above three data display page in (3) the 3rd steps, wherein using broken line graph, block diagram, cake chart, heating power
Figure and word cloud are shown to data, and the technology of use is HTML5, CSS, JAVASCRIPT and AJAX.
As shown in fig. 6, data and data of the data storage and management module to upload in community's text data uploading module
The data produced in pretreatment module, vectorial extraction module and much-talked-about topic extraction module carry out storage and management.Wherein
The condition of the people data and municipal administration's data of urtext data source on HDFS in community's text data uploading module, caching
The term vector result that data source is trained in vectorial extraction module in Redis, the data source in HBase is in data prediction
Module and the data of much-talked-about topic extraction module generation.Data storage and management module carries the data produced to above-mentioned module
Storage, and HDFS file system, HBase databases and Redis cache databases are managed, completion increases, deletes, changing,
Look into operation.Access HDFS and HBase and use original API, access Redis and use packaged Jedis JAR bags.Data storage
It is as follows step to be implemented with management module:
(1) original community's document data can be stored in HDFS file system, and its store path is stored in HBase databases
In.
(2) term vector that training is completed is stored in Redis databases in the form of key assignments.
(3) intermediate result data that pretreatment is obtained is stored in HBase numbers with the result data after much-talked-about topic is formed
According in storehouse.
Implementation process of the invention is described in detail above, is not described in detail and is partly belonged to techniques well known.
Claims (2)
1. a kind of topic based on community's text data finds system, it is characterised in that:Including mobile terminal service system kimonos
Business device end system;Mobile terminal service system includes community's text data uploading module, is responsible for the collection of community's text data, carries
Take and upload, receive community's text data of collection, extract the type of community's text data, and by community's text data and
Type serves the data preprocessing module for reaching server-side system;Community's text data includes condition of the people data and municipal administration's number
Include community environment health subject data, Community Safety subject data, community's parking and bulletin number of topics according to, condition of the people data
According to this and community management subject data, community management subject data refers to property and the daily law enforcement situation of grid person, condition of the people data by
Community resident uploads with community grid person;Municipal administration's data are uploaded by municipal administration personnel, and municipal administration's data include city appearance environment health
Subject data, municipal subject data, urban afforestation subject data, urban planning subject data and industrial and commercial administration subject data;Institute
The type for stating community's text data includes TXT forms, html format, XML format;Server-side system includes data prediction mould
Block, vectorial extraction module, much-talked-about topic extraction module, data visualization module and data storage and management module;
Data preprocessing module:Community's text data that community's text data uploading module is uploaded is read, and carries out community's text
The cleaning of data and Chinese word segmentation;Community's text data reads and takes different reading plans for different types of data
Slightly, the community's text data for TXT forms directly reads into data-stream form using BufferedReader in JAVA, for
HTML uses the interpretive model of DOM with community's text data of XML format;The cleaning of community's text data is completed to society
The carrying out for having repetition to report in area's text data is rejected;The Chinese word segmentation is that community's text data after cleaning is cut into
The word feature vector of literary word composition;
Vectorial extraction module:It is responsible for representing community's text data vectorization;Based on Chinese corpus to data preprocessing module
The word feature vector for obtaining afterwards is trained, and extracts keyword phrase, and calculate keyword weights;With reference to word feature vector and
Keyword weights are weighted average computation and draw Text eigenvector;The word feature vector training is using Word2Vec's
Version of java, Chinese corpus uses the news corpus storehouse of project location;Keyword phrase is mentioned to be carried using TF-IDF features
Take;
Much-talked-about topic extraction module:Based on the keyword phrase drawn in vectorial extraction module and the characteristic vector of text, use
Single-Pass clusters are clustered to text, are obtained after class cluster according to the keyword phrase extracted in vectorial extraction module,
Keyword in keyword phrase is counted, descending arrangement after the completion of statistics, so as to generate much-talked-about topic;
Data visualization module:It is User Interface, external application and displaying task is completed, in community's text uploading module
The community's text data for obtaining and much-talked-about topic extraction module in the much-talked-about topic data display that obtains at page end;Displaying
The much-talked-about topic generated after data total amount overview of the content including community's text data, data distribution overview and data analysis, heat
The display form of point topic intuitively shows number using various display forms of form, block diagram, broken line graph and combination map
According to;The data entirety overview display data overview, the total species of data, data total amount and each area, total amount of theme;Number
Data distribution situation is shown on map according to distribution overview, intuitively display data geographical position;Data analysis overview
The much-talked-about topic generated after treatment is showed into Community Administrators using form, chart and with reference to forms such as maps;
Data storage and management module, to the community's text data, the data prediction that are uploaded in community's text data uploading module
The related data produced in module, vectorial extraction module and much-talked-about topic extraction module carries out storage and management;Community's text
Community's text data that data uploading module is uploaded is buffered in HDFS file system, the term vector trained in vectorial extraction module
In Redis cache databases, the data buffer storage of data preprocessing module and the generation of much-talked-about topic extraction module exists result cache
In HBase databases;HDFS file system, HBase databases and Redis cache databases are managed, are completed therein
Data increase, delete, change and inquiry operation;Simultaneously support timed task to Redis cache databases in it is data cached enter
Row updates, and safeguards the concordance list in HBase databases, optimizes the inquiry to data, to blocks of files in HDFS file system
Storage merge optimization processing;Access HDFS and HBase and use original API, access Redis using packaged
Jedis JAR bags.
2. a kind of topic based on community's text data finds method, it is characterised in that:Including community text data uploading step,
Data prediction step, vectorial extraction step, much-talked-about topic extraction step, data visualization step, data storage and management step,
Wherein text data uploading step in community's is completed by mobile terminal service system, community's text of mobile terminal service system collection
Data, extract the type of community's text data, and community's text data and type are served reach server-side system;Institute
Stating community's text data includes condition of the people data and municipal administration's data, and the condition of the people data include community environment health subject data, society
The safe subject data in area, community's parking and bulletin subject data and community management subject data, community management subject data refer to
Property and the daily law enforcement situation of grid person, condition of the people data are uploaded by community resident and community grid person;Municipal administration's data are by municipal administration people
Member uploads, and municipal administration's data include city appearance environment health subject data, municipal subject data, urban afforestation subject data, city
Plan subject data and industrial and commercial administration subject data in city;The type of community's text data include TXT forms, html format,
XML format;Server-side system carries out data preprocessing module, vectorial extraction module, much-talked-about topic extraction module, data visualization
The execution of module and data storage and management module;
Data prediction step:Community's text data of community's text data uploading step is read, and carries out community's text data
Cleaning and Chinese word segmentation;Community's text data reads and takes different reading strategies for different types of data, right
BufferedReader directly reads into data-stream form in community's text data of TXT forms is using JAVA, for HTML
Community's text data with XML format uses the interpretive model of DOM;The cleaning of community's text data is completed to community's text
The carrying out for having repetition to report in notebook data is rejected;The Chinese word segmentation is that community's text data after cleaning is cut into by Chinese single
The word feature vector of word composition;
Vectorial extraction step:The word feature vector obtained after data prediction step is trained based on Chinese corpus, is carried
Keyword phrase is taken, and calculates keyword weights;Average computation is weighted with reference to word feature vector and keyword weights to obtain
Go out Text eigenvector;The word feature vector training uses the version of java of Word2Vec, Chinese corpus to use project institute
In the news corpus storehouse on ground;Keyword phrase is mentioned using TF-IDF feature extractions;
Much-talked-about topic extraction step:Based on the keyword phrase drawn in vectorial extraction step and the characteristic vector of text, use
Single-Pass clusters are clustered to text, are obtained after class cluster according to the keyword phrase extracted in vectorial extraction step,
Keyword in keyword phrase is counted, descending arrangement after the completion of statistics, so as to generate much-talked-about topic;
Data visualization step:It is User Interface, external application and displaying task is completed, in community's text uploading step
The community's text data for obtaining and much-talked-about topic extraction step in the much-talked-about topic data display that obtains at page end;Displaying
The much-talked-about topic generated after data total amount overview of the content including community's text data, data distribution overview and data analysis, heat
The display form of point topic intuitively shows number using various display forms of form, block diagram, broken line graph and combination map
According to;The data entirety overview display data overview, the total species of data, data total amount and each area, total amount of theme;Number
Data distribution situation is shown on map according to distribution overview, intuitively display data geographical position;Data analysis overview
The much-talked-about topic generated after treatment is showed into Community Administrators using form, chart and with reference to forms such as maps;
Data storage and management step, it is pre- to the community text data step that is uploaded in community's text data uploading step, data
The related data produced in process step, vectorial extraction step and much-talked-about topic extraction step carries out storage and management;Community
Community's text data that text data uploading step is uploaded is buffered in HDFS file system, the word trained in vectorial extraction step
Vector result is cached in Redis cache databases, and the data of data prediction step and the generation of much-talked-about topic extraction step are delayed
In there is HBase databases;HDFS file system, HBase databases and Redis cache databases are managed, it is completed
In data increase, delete, modification and inquiry operation;Support timed task to the caching number in Redis cache databases simultaneously
According to being updated, and the concordance list in HBase databases is safeguarded, optimize the inquiry to data, to HDFS file system Chinese
The storage of part block merges optimization processing.Access HDFS and HBase and use original API, access Redis using packaged
Jedis JAR bags.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710115832.1A CN106777395A (en) | 2017-03-01 | 2017-03-01 | A kind of topic based on community's text data finds system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710115832.1A CN106777395A (en) | 2017-03-01 | 2017-03-01 | A kind of topic based on community's text data finds system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106777395A true CN106777395A (en) | 2017-05-31 |
Family
ID=58960207
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710115832.1A Pending CN106777395A (en) | 2017-03-01 | 2017-03-01 | A kind of topic based on community's text data finds system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106777395A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108376175A (en) * | 2018-03-02 | 2018-08-07 | 成都睿码科技有限责任公司 | Visualization method for displaying news events |
CN109359302A (en) * | 2018-10-26 | 2019-02-19 | 重庆大学 | A kind of optimization method of field term vector and fusion sort method based on it |
CN109525740A (en) * | 2018-10-12 | 2019-03-26 | 成都北科维拓科技有限公司 | A kind of event-handling method and system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103177024A (en) * | 2011-12-23 | 2013-06-26 | 微梦创科网络科技(中国)有限公司 | Method and device of topic information show |
CN105022840A (en) * | 2015-08-18 | 2015-11-04 | 新华网股份有限公司 | News information processing method, news recommendation method and related devices |
CN105138592A (en) * | 2015-07-31 | 2015-12-09 | 武汉虹信技术服务有限责任公司 | Distributed framework-based log data storing and retrieving method |
CN105718590A (en) * | 2016-01-27 | 2016-06-29 | 福州大学 | Multi-tenant oriented SaaS public opinion monitoring system and method |
US20160196174A1 (en) * | 2015-01-02 | 2016-07-07 | Tata Consultancy Services Limited | Real-time categorization of log events |
CN106202566A (en) * | 2016-08-02 | 2016-12-07 | 山东鲁能软件技术有限公司 | A kind of magnanimity electricity consumption data mixing based on big data storage system and method |
-
2017
- 2017-03-01 CN CN201710115832.1A patent/CN106777395A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103177024A (en) * | 2011-12-23 | 2013-06-26 | 微梦创科网络科技(中国)有限公司 | Method and device of topic information show |
US20160196174A1 (en) * | 2015-01-02 | 2016-07-07 | Tata Consultancy Services Limited | Real-time categorization of log events |
CN105138592A (en) * | 2015-07-31 | 2015-12-09 | 武汉虹信技术服务有限责任公司 | Distributed framework-based log data storing and retrieving method |
CN105022840A (en) * | 2015-08-18 | 2015-11-04 | 新华网股份有限公司 | News information processing method, news recommendation method and related devices |
CN105718590A (en) * | 2016-01-27 | 2016-06-29 | 福州大学 | Multi-tenant oriented SaaS public opinion monitoring system and method |
CN106202566A (en) * | 2016-08-02 | 2016-12-07 | 山东鲁能软件技术有限公司 | A kind of magnanimity electricity consumption data mixing based on big data storage system and method |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108376175A (en) * | 2018-03-02 | 2018-08-07 | 成都睿码科技有限责任公司 | Visualization method for displaying news events |
CN108376175B (en) * | 2018-03-02 | 2022-05-13 | 成都睿码科技有限责任公司 | Visualization method for displaying news events |
CN109525740A (en) * | 2018-10-12 | 2019-03-26 | 成都北科维拓科技有限公司 | A kind of event-handling method and system |
CN109525740B (en) * | 2018-10-12 | 2021-01-26 | 成都北科维拓科技有限公司 | Event processing method and system |
CN109359302A (en) * | 2018-10-26 | 2019-02-19 | 重庆大学 | A kind of optimization method of field term vector and fusion sort method based on it |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11580104B2 (en) | Method, apparatus, device, and storage medium for intention recommendation | |
CN105468605B (en) | Entity information map generation method and device | |
Bozarth et al. | Toward a better performance evaluation framework for fake news classification | |
CN103324665B (en) | Hot spot information extraction method and device based on micro-blog | |
Sankaranarayanan et al. | Twitterstand: news in tweets | |
CN104182517B (en) | The method and device of data processing | |
CN103226578B (en) | Towards the website identification of medical domain and the method for webpage disaggregated classification | |
CN111104511B (en) | Method, device and storage medium for extracting hot topics | |
Zhou et al. | Real-time news cer tification system on sina weibo | |
CN105320719B (en) | A kind of crowd based on item label and graphics relationship raises website item recommended method | |
CN106383887A (en) | Environment-friendly news data acquisition and recommendation display method and system | |
CN104834693A (en) | Depth-search-based visual image searching method and system thereof | |
CN102546771A (en) | Cloud mining network public opinion monitoring system based on characteristic model | |
CN102622443A (en) | Customized screening system and method for microblog | |
CN104182389A (en) | Semantic-based big data analysis business intelligence service system | |
CN102446225A (en) | Real-time search method, device and system | |
CN107291886A (en) | A kind of microblog topic detecting method and system based on incremental clustering algorithm | |
CN105718590A (en) | Multi-tenant oriented SaaS public opinion monitoring system and method | |
CN105378730A (en) | Social media content analysis and output | |
CN104142995A (en) | Social event recognition method based on visual attributes | |
CN108647322A (en) | The method that word-based net identifies a large amount of Web text messages similarities | |
CN103412903B (en) | The Internet of Things real-time searching method and system predicted based on object of interest | |
CN106294473B (en) | Entity word mining method, information recommendation method and device | |
CN103886020A (en) | Quick search method of real estate information | |
CN102855245A (en) | Image similarity determining method and image similarity determining equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170531 |