CN106777395A - A kind of topic based on community's text data finds system - Google Patents

A kind of topic based on community's text data finds system Download PDF

Info

Publication number
CN106777395A
CN106777395A CN201710115832.1A CN201710115832A CN106777395A CN 106777395 A CN106777395 A CN 106777395A CN 201710115832 A CN201710115832 A CN 201710115832A CN 106777395 A CN106777395 A CN 106777395A
Authority
CN
China
Prior art keywords
data
community
text
text data
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710115832.1A
Other languages
Chinese (zh)
Inventor
熊桂喜
朱宁
何滔
邹哲讷
赵再让
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201710115832.1A priority Critical patent/CN106777395A/en
Publication of CN106777395A publication Critical patent/CN106777395A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

System, including mobile terminal service system and server-side system are found the invention discloses a kind of topic based on community's text data;Mobile terminal service system includes community's text data uploading module, it is responsible for the collection of community's text data, extracts and upload, receive community's text data of collection, and the type of community's text data is extracted, and community's text data and type are served the data preprocessing module for reaching server-side system;The type of community's text data includes TXT forms, html format, XML format;Server-side system includes data preprocessing module, vectorial extraction module, much-talked-about topic extraction module, data visualization module and data storage and management module;The present invention makes community resident, and community service personnel and municipal administration are participated in community management, accelerates operating efficiency, realizes that the wisdomization management of community provides efficient service.

Description

A kind of topic based on community's text data finds system
Technical field
System and method is found the present invention relates to a kind of topic based on community's text data, belongs to computer and network skill Art application field.
Background technology
With the Fast Construction with urban informationization and development, national economy significant progress, in high speed Urbanization Construction In bring the problems such as population management, urban transportation, environmental protection and social security, hindered the pin of urban development Step.The construction of community is the basis of urban construction, improves the good fortune that community service quality is directly connected to community resident's daily life Good fortune index.How to allow the community resident really to enjoy the bonus that smart city is brought, be the top priority of the building of communities, so Community is made full use of to produce data.
The demand of understanding community service is deeply gone, is analyzed from text data aspect, text data digging is related at present Pick is generally public sentiment system or the analysis for doing key person or netizen's mood, pin is analyzed to instantly more popular micro-blog Do not have to the data that community inside and city manager produce and effectively use, currently without pertinent literature report.
The content of the invention
The technical problem to be solved in the present invention:Overcome the deficiencies in the prior art, there is provided a kind of based on community's text data Topic finds system and method, it is possible to increase community service level, meet community management feature, easy to use, can make community Resident, community service personnel and municipal administration are participated in community management, accelerate operating efficiency, realize the wisdomization management of community.
One of the technical solution adopted by the present invention:A kind of topic based on community's text data finds system, including movement Based terminal system and server-side system;Mobile terminal service system includes community's text data uploading module, is responsible for community Collection, extraction and the upload of text data, receive community's text data of collection, and extract the class of community's text data Type, and community's text data and type are served the data preprocessing module for reaching server-side system;Community's textual data According to type include TXT forms, html format, XML format;Server-side system includes that data preprocessing module, vector are extracted Module, much-talked-about topic extraction module, data visualization module and data storage and management module;
Data preprocessing module:Community's text data that community's text data uploading module is uploaded is read, and carries out community The cleaning of text data and Chinese word segmentation;Community's text data reads and takes different readings for different types of data Strategy, the community's text data for TXT forms directly reads into data-stream form using BufferedReader in JAVA, right The interpretive model of DOM is used in community's text data of HTML and XML format;It is right that the cleaning of community's text data is completed The carrying out for having repetition to report in community's text data is rejected;The Chinese word segmentation be by community's text data after cleaning be cut into by The word feature vector of Chinese language words composition;
Vectorial extraction module:It is responsible for representing community's text data vectorization;Based on Chinese corpus to data prediction The word feature vector obtained after module is trained, and extracts keyword phrase, and calculate keyword weights;Bluebeard compound feature to Amount and keyword weights are weighted average computation and draw Text eigenvector;The word feature vector training uses Word2Vec Version of java, Chinese corpus use project location news corpus storehouse (in this project corpus for Shaanxi Province's news with Zhejiang Province's news, or search dog news corpus storehouse);Keyword phrase is mentioned using TF-IDF feature extractions;
Much-talked-about topic extraction module:Based on the keyword phrase drawn in vectorial extraction module and the characteristic vector of text, Text is clustered using Single-Pass clusters, is obtained after class cluster according to the keyword extracted in vectorial extraction module Phrase, counts to the keyword in keyword phrase, descending arrangement after the completion of statistics, so as to generate much-talked-about topic;
Data visualization module:It is User Interface, completes external application and displaying task, community's text is uploaded mould The much-talked-about topic data display obtained in the community's text data for obtaining and much-talked-about topic extraction module in block is at page end;Exhibition The focus words generated after data total amount overview of the content shown including community's text data, data distribution overview and data analysis Topic, the display form of much-talked-about topic uses form, block diagram, broken line graph and combines various display forms of map, intuitively Display data;The data entirety overview display data overview, the total species of data, data total amount and each regional, theme Total amount;Data distribution overview shows data distribution situation on map, intuitively display data geographical position;Data point The much-talked-about topic generated after treatment is showed Community Administrators by analysis overview using form, chart and with reference to forms such as maps;
Data storage and management module, it is pre- to community's text data of upload, data in community's text data uploading module The related data produced in processing module, vectorial extraction module and much-talked-about topic extraction module carries out storage and management;Community Community's text data that text data uploading module is uploaded is buffered in HDFS file system, the word trained in vectorial extraction module Vector result is cached in Redis cache databases, and the data of data preprocessing module and the generation of much-talked-about topic extraction module are delayed In there is HBase databases;HDFS file system, HBase databases and Redis cache databases are managed, it is completed In data increase, delete, modification and inquiry operation;Support timed task to the caching number in Redis cache databases simultaneously According to being updated, and the concordance list in HBase databases is safeguarded, optimize the inquiry to data, to HDFS file system Chinese The storage of part block merges optimization processing.
Community's text data includes condition of the people data and municipal administration's data, and the condition of the people data include community environment health master Topic data, Community Safety subject data, community's parking and bulletin subject data and community management subject data, community management master Topic data refer to property and the daily law enforcement situation of grid person, and condition of the people data are uploaded by community resident and community grid person;Municipal administration's data Uploaded by municipal administration personnel, municipal administration's data include city appearance environment health subject data, municipal subject data, urban afforestation theme Data, urban planning subject data and industrial and commercial administration subject data.
The two of the technology of the present invention solution:A kind of topic based on community's text data finds method, including:Community's text Notebook data uploading step, data prediction step, vectorial extraction step, much-talked-about topic extraction step, data visualization step, data Storage and management step, wherein community's text data uploading step are completed by mobile terminal service system, mobile terminal service system Community's text data of system collection, extracts the type of community's text data, and community's text data and type are served into biography To server-side system;Community's text data includes condition of the people data and municipal administration's data, and the condition of the people data include community's ring Border health subject data, Community Safety subject data, community's parking and bulletin subject data and community management subject data, society Area's management subject data refers to property and the daily law enforcement situation of grid person, and condition of the people data are uploaded by community resident and community grid person; Municipal administration's data are uploaded by municipal administration personnel, and municipal administration's data include city appearance environment health subject data, municipal subject data, city Greening subject data, urban planning subject data and industrial and commercial administration subject data;The type of community's text data includes TXT Form, html format, XML format;Server-side system carries out data preprocessing module, vectorial extraction module, much-talked-about topic and carries The execution of modulus block, data visualization module and data storage and management module;
Data prediction step:Community's text data of community's text data uploading step is read, and carries out community's text The cleaning of data and Chinese word segmentation;Community's text data reads and takes different reading plans for different types of data Slightly, the community's text data for TXT forms directly reads into number using BufferedReader (caching reader) in JAVA According to manifold formula, the community's text data for HTML and XML format uses the interpretive model of DOM (DOM Document Object Model);Institute The cleaning for stating community's text data is completed to there is the carrying out that repetition is reported to reject in community's text data;The Chinese word segmentation be by Community's text data is cut into the word feature vector being made up of Chinese language words after cleaning;
Vectorial extraction step:The word feature vector obtained after data prediction step is instructed based on Chinese corpus Practice, extract keyword phrase, and calculate keyword weights;Average meter is weighted with reference to word feature vector and keyword weights Calculation draws Text eigenvector;The word feature vector training uses the version of java of Word2Vec, Chinese corpus to use item The news corpus storehouse in mesh location;Keyword phrase is mentioned using TF-IDF (term frequency-inverse document frequency) feature extraction;
Much-talked-about topic extraction step:Based on the keyword phrase drawn in vectorial extraction step and the characteristic vector of text, Text is clustered using Single-Pass clusters, is obtained after class cluster according to the keyword extracted in vectorial extraction step Phrase, counts to the keyword in keyword phrase, descending arrangement after the completion of statistics, so as to generate much-talked-about topic;
Data visualization step:It is User Interface, completes external application and displaying task, community's text is uploaded and is walked The much-talked-about topic data display obtained in the community's text data for obtaining and much-talked-about topic extraction step in rapid is at page end;Exhibition The focus words generated after data total amount overview of the content shown including community's text data, data distribution overview and data analysis Topic, the display form of much-talked-about topic uses form, block diagram, broken line graph and combines various display forms of map, intuitively Display data;The data entirety overview display data overview, the total species of data, data total amount and each regional, theme Total amount;Data distribution overview shows data distribution situation on map, intuitively display data geographical position;Data point The much-talked-about topic generated after treatment is showed Community Administrators by analysis overview using form, chart and with reference to forms such as maps;
Data storage and management step, to the community text data step, the number that are uploaded in community's text data uploading step The related data produced in Data preprocess step, vectorial extraction step and much-talked-about topic extraction step carries out storage and management; Community's text data that community's text data uploading step is uploaded is buffered in HDFS (distributed file system) file system, to The term vector result cache trained in amount extraction step in Redis cache databases, data prediction step and much-talked-about topic The data buffer storage of extraction step generation is in HBase databases;To HDFS file system, HBase (the distributed data towards row Storehouse) database and Redis (database of key-value form) cache database be managed, and completing data therein increases, deletes Except, modification and inquiry operation;Simultaneously support timed task to Redis cache databases in it is data cached be updated, and The concordance list in HBase databases is safeguarded, optimizes the inquiry to data, the storage to blocks of files in HDFS file system is closed And optimization processing.Access HDFS and HBase and use original API, access Redis and use packaged Jedis JAR bags.
The present invention has advantageous effect in that compared with prior art:
(1) present invention uploads text data, city management people to find that community's much-talked-about topic is target by community resident Member reports and is combined the three-dimensional extraction topic of various dimensions with reference to internet data, using the keyword extraction of comparative maturity Technology, introduces term vector technology, chooses clustering algorithm for community's text feature and optimizes the more preferable Clustering Effect of completion to it, System High Availabitity, cross-platform, expansible and support that PC ends and mobile phone terminal are browsed is ensured, wherein reporting the condition of the people and city with reference to mobile phone terminal Pipe reported event, according to the data characteristicses of community's text, excavates more valuable topic information, and community residence is met as far as possible The people and the demand of manager, realize that passive information issue is thought actively to provide service conversion, and the wording of community's text data is worth Bigger is brought into play, and corresponding contribution is made to build harmony in communities.
(2) present invention is using mobile terminal and server interaction mode so that community resident, Community Administrators and municipal administration The strategy that three is combined, proposes that a kind of much-talked-about topic of efficient, stabilization multiple terminals displaying finds system, is finally community Resident provides service with manager, greatly improves community service quality;The characteristics of present invention takes into full account community's text, structure The data support system that a convenience studies and judges community's state for Community Administrators is built.
(3) present invention proposes that community's much-talked-about topic finds thought, using Chinese words segmentation, keyword extraction techniques, word Vectorial extractive technique and represent text will occur in community using vectorization finally by Single-Pass clustering algorithms Keyword extraction more than the frequency out ultimately forms much-talked-about topic, to the experimental result of Single-Pass clustering algorithms optimization For:
(4) to the less demanding of system configuration, not excessive human input has occupancy resource few to the present invention, cheap, easily In popularization.
Brief description of the drawings
Fig. 1 is system composition schematic diagram of the invention;
Fig. 2 is data preprocessing module flow chart of the invention;
Fig. 3 is vectorial extraction module flow chart of the invention;
Fig. 4 is much-talked-about topic extraction module flow chart of the invention;
Fig. 5 is data visualization block flow diagram of the invention;
Fig. 6 is data storage and management function structure chart of the invention.
Specific embodiment
Below in conjunction with the accompanying drawings the present invention is further illustrated with specific embodiment.
As shown in figure 1, the present invention includes mobile terminal service system and server-side system;Mobile terminal service system bag Community's text data uploading module is included, is responsible for the collection of community's text data, extracted and upload, receive community's textual data of collection According to and extracting the type of community's text data, and community's text data and type are served reach server-side system Data preprocessing module;The type of community's text data includes TXT forms, html format, XML format;Server end system System includes data preprocessing module, vectorial extraction module, much-talked-about topic extraction module, data visualization module and data storage and pipe Reason mould.
As shown in Fig. 2 data preprocessing module first carries out condition of the people data and municipal administration's data headed by handling process order Read, including TXT forms, html format, XML format text data reading.Data cleansing function is mainly completed to the people There is the carrying out that repetition is reported to reject in feelings data and municipal administration data.Next Chinese Word Segmentation is completed.
Data preprocessing module is implemented as follows:
(1) text data reading is carried out first, and the form that can read is TXT, HTML and XML, according to the tray for reading Formula is different, and wherein TXT forms directly read into data-stream form, HTML and XML format using BufferedReader in JAVA Data are using the interpretive model of DOM.
(2) next need to carry out the data for reading duplicate checking judgement, that is, before the text data for finding to read Read and just weed out and read next, participle step had been entered if not reading.
(3) word segmentation processing is finally carried out to text data, Chinese word segmentation uses the version of java of ICTLAS, ICTLAS It is the Chinese word segmentation Open-Source Tools of Inst. of Computing Techn. Academia Sinica's issue, cutting is carried out to text, most textual data at last According to being cut into the characteristic vector that is made up of Chinese language words.
As shown in figure 3, vectorial abstraction function module data handling process can judge whether corpus updates first, this function For timed task is performed once because term vector training once needs to expend or so 4 hours, when needing exist for updating for one month It is also to do offline task treatment.The handling process of its core is the treatment to text data, first to complete using TFIDF and dictionary The extraction of paired text keyword, extracts secondary vector, based on three weights and three of keyword in the data set for training The characteristic vector of individual keyword, is weighted average computation so as to obtain the characteristic vector of text.
Vectorial extraction module is implemented as follows:
(1) first step is read by the data after pretreatment module treatment, judges whether corpus needs to update, if more It is new then using the version of java re -training of Word2Vec, then carry out next step and extract keyword without updating.
(2) second step needs to extract keyword, uses TF-IDF feature extractions while calculating the weights of TF-IDF.
(3) the 3rd steps read the term vector set trained in the first step, under being chosen without if in term vector set One keyword, reads out term vector if having in term vector set, chooses three keywords of weights highest in second step And vector.
(4) weights and three characteristic vectors of keyword based on three keywords of acquisition in the 3rd step, are weighted flat Calculate so as to obtain the characteristic vector of text.The circular of the text representation model based on semantic weighting is as follows:It is false If it is t to need the text for representing, corresponding vector representation is y;The text is crucial by first three obtained after TF-IDF calculating Word is respectively a1i、a2iAnd a3i, corresponding TF-IDF values respectively w1i, w2iAnd w3i, corresponding vector respectively x1={ x11; x12;…;x150, x2={ x21;x22;…;x250And x3={ x31;x32;…;x350}.Then there is computing formula as follows:
As shown in figure 4, the data of much-talked-about topic extraction module input are the characteristic vector of text, if input is first Data then sets up first topic, if not then with Single-Pass clustering algorithms cluster, cluster process Shi Dangyu centers The similarity of point is more than VcWhen calculated in the distance for proceeding to such other point, obtain the i.e. similarity of such distance Value, more than VnCorresponding topic is included into, otherwise newly-built topic counts keyword quantity, much-talked-about topic word is obtained based on number sequence Group, forms much-talked-about topic word cloud.
Much-talked-about topic extraction module is implemented as follows:
(1) first step reads the Text eigenvector produced in vectorial extraction module, determines whether the first data, such as Fruit is that the first data is then directly established as new topic, if not then carrying out next step text cluster operation.
(2) second step is clustered using Single-Pass clustering algorithms, and in Single-Pass clustering algorithms, point arrives class Similarity (distance is nearer, and similarity is bigger) measurement typically has three kinds, is respectively new_distance_max, ave_ Distance and inner_distance_max, explains to above three parameter below.new_distance_max Similarity:The new point for adding to the inside of class distance a little maximum;Ave_distance similarities:It is new to add Point to the inside of existing class a little between distance average value;Inner_distance_max similarities:The new point for adding To the inside of class a little between distance maximum;Define three threshold values and correspond to three similarities mentioned above respectively. The threshold value of THRESHOLD_NEW correspondence new_distance_max similarities, THRESHOLD_AVE correspondence ave_distance phases Like the threshold value of the threshold value inner_distance_max similarities corresponding with THRESHOLD_INNER of degree.Assuming that the new text for adding This is Di, existing class M1。DiTo M1New_distance_max similarities be simnew(Di,M1), DiTo M1Ave_ Distance similarities are simave(Di,M1), DiTo M1Inner_distance_max similarities be siminner(Di,M1), then Average calculating formula of similarity is:
Wherein parameter meets λ+μ+η=1 and λ >=0, μ >=0, η >=0, when calculating three Similarity values respectively less than its threshold During value text D is added then newiTo M1Similarity sim (Di,M1) it is 0, there is one in having three not less than its threshold value just Meet λ simnew+μsimave+ηsiminnerLinear combination.Training result according to search dog news corpus storehouse works as λ+μ+η=1 and λ When >=0, μ >=0, η >=0, in λ=0.2, μ=0.5, Clustering Effect is more satisfactory in the case of η=0.3.
(3) similarity of this Text eigenvector of calculating and existing class is needed, if similitude is similar to class central point Angle value is more than Vc, then calculate the Similarity value of this and such other points, and preserve maximum similarity value, obtain this sample point with The maximum of all kinds of similarities, if maximum with the Similarity value of class and more than VnThis sample point is then included into the class compared with it In;If less than VnA then newly-built class;If less than VcThis Text eigenvector and other classes are then carried out into Similarity Measure.
(4) need if being included into existing topic class to count keyword quantity mirror image in second step, count every Individual keyword has appearance altogether several times.
(5) the 4th steps need to determine whether the last item data, if the last item data are then based on the pass for counting Keyword quantity descending is arranged, and forms much-talked-about topic word phrase, then jumps back to be input into Text eigenvector if not the last item.
As shown in figure 5, data visualization module is mainly responsible for the request for the treatment of application layer and presentation layer, in displaying whole system Data total amount overview, data distribution overview and data analysis after the much-talked-about topic that generates, much-talked-about topic displaying aspect is using report Table, chart simultaneously combine the forms such as map, and in community's life, service can also be furnished with picture.
Data visualization module is implemented as follows:
(1) first step is system demonstration request, that is, user please by carrying out data display after browser input network address Ask.
(2) First page information content shown in second step, is respectively overall general data wherein there is three subtitles to connect Condition, data distribution overview and data analysis overview.
It is exactly the above three data display page in (3) the 3rd steps, wherein using broken line graph, block diagram, cake chart, heating power Figure and word cloud are shown to data, and the technology of use is HTML5, CSS, JAVASCRIPT and AJAX.
As shown in fig. 6, data and data of the data storage and management module to upload in community's text data uploading module The data produced in pretreatment module, vectorial extraction module and much-talked-about topic extraction module carry out storage and management.Wherein The condition of the people data and municipal administration's data of urtext data source on HDFS in community's text data uploading module, caching The term vector result that data source is trained in vectorial extraction module in Redis, the data source in HBase is in data prediction Module and the data of much-talked-about topic extraction module generation.Data storage and management module carries the data produced to above-mentioned module Storage, and HDFS file system, HBase databases and Redis cache databases are managed, completion increases, deletes, changing, Look into operation.Access HDFS and HBase and use original API, access Redis and use packaged Jedis JAR bags.Data storage It is as follows step to be implemented with management module:
(1) original community's document data can be stored in HDFS file system, and its store path is stored in HBase databases In.
(2) term vector that training is completed is stored in Redis databases in the form of key assignments.
(3) intermediate result data that pretreatment is obtained is stored in HBase numbers with the result data after much-talked-about topic is formed According in storehouse.
Implementation process of the invention is described in detail above, is not described in detail and is partly belonged to techniques well known.

Claims (2)

1. a kind of topic based on community's text data finds system, it is characterised in that:Including mobile terminal service system kimonos Business device end system;Mobile terminal service system includes community's text data uploading module, is responsible for the collection of community's text data, carries Take and upload, receive community's text data of collection, extract the type of community's text data, and by community's text data and Type serves the data preprocessing module for reaching server-side system;Community's text data includes condition of the people data and municipal administration's number Include community environment health subject data, Community Safety subject data, community's parking and bulletin number of topics according to, condition of the people data According to this and community management subject data, community management subject data refers to property and the daily law enforcement situation of grid person, condition of the people data by Community resident uploads with community grid person;Municipal administration's data are uploaded by municipal administration personnel, and municipal administration's data include city appearance environment health Subject data, municipal subject data, urban afforestation subject data, urban planning subject data and industrial and commercial administration subject data;Institute The type for stating community's text data includes TXT forms, html format, XML format;Server-side system includes data prediction mould Block, vectorial extraction module, much-talked-about topic extraction module, data visualization module and data storage and management module;
Data preprocessing module:Community's text data that community's text data uploading module is uploaded is read, and carries out community's text The cleaning of data and Chinese word segmentation;Community's text data reads and takes different reading plans for different types of data Slightly, the community's text data for TXT forms directly reads into data-stream form using BufferedReader in JAVA, for HTML uses the interpretive model of DOM with community's text data of XML format;The cleaning of community's text data is completed to society The carrying out for having repetition to report in area's text data is rejected;The Chinese word segmentation is that community's text data after cleaning is cut into The word feature vector of literary word composition;
Vectorial extraction module:It is responsible for representing community's text data vectorization;Based on Chinese corpus to data preprocessing module The word feature vector for obtaining afterwards is trained, and extracts keyword phrase, and calculate keyword weights;With reference to word feature vector and Keyword weights are weighted average computation and draw Text eigenvector;The word feature vector training is using Word2Vec's Version of java, Chinese corpus uses the news corpus storehouse of project location;Keyword phrase is mentioned to be carried using TF-IDF features Take;
Much-talked-about topic extraction module:Based on the keyword phrase drawn in vectorial extraction module and the characteristic vector of text, use Single-Pass clusters are clustered to text, are obtained after class cluster according to the keyword phrase extracted in vectorial extraction module, Keyword in keyword phrase is counted, descending arrangement after the completion of statistics, so as to generate much-talked-about topic;
Data visualization module:It is User Interface, external application and displaying task is completed, in community's text uploading module The community's text data for obtaining and much-talked-about topic extraction module in the much-talked-about topic data display that obtains at page end;Displaying The much-talked-about topic generated after data total amount overview of the content including community's text data, data distribution overview and data analysis, heat The display form of point topic intuitively shows number using various display forms of form, block diagram, broken line graph and combination map According to;The data entirety overview display data overview, the total species of data, data total amount and each area, total amount of theme;Number Data distribution situation is shown on map according to distribution overview, intuitively display data geographical position;Data analysis overview The much-talked-about topic generated after treatment is showed into Community Administrators using form, chart and with reference to forms such as maps;
Data storage and management module, to the community's text data, the data prediction that are uploaded in community's text data uploading module The related data produced in module, vectorial extraction module and much-talked-about topic extraction module carries out storage and management;Community's text Community's text data that data uploading module is uploaded is buffered in HDFS file system, the term vector trained in vectorial extraction module In Redis cache databases, the data buffer storage of data preprocessing module and the generation of much-talked-about topic extraction module exists result cache In HBase databases;HDFS file system, HBase databases and Redis cache databases are managed, are completed therein Data increase, delete, change and inquiry operation;Simultaneously support timed task to Redis cache databases in it is data cached enter Row updates, and safeguards the concordance list in HBase databases, optimizes the inquiry to data, to blocks of files in HDFS file system Storage merge optimization processing;Access HDFS and HBase and use original API, access Redis using packaged Jedis JAR bags.
2. a kind of topic based on community's text data finds method, it is characterised in that:Including community text data uploading step, Data prediction step, vectorial extraction step, much-talked-about topic extraction step, data visualization step, data storage and management step, Wherein text data uploading step in community's is completed by mobile terminal service system, community's text of mobile terminal service system collection Data, extract the type of community's text data, and community's text data and type are served reach server-side system;Institute Stating community's text data includes condition of the people data and municipal administration's data, and the condition of the people data include community environment health subject data, society The safe subject data in area, community's parking and bulletin subject data and community management subject data, community management subject data refer to Property and the daily law enforcement situation of grid person, condition of the people data are uploaded by community resident and community grid person;Municipal administration's data are by municipal administration people Member uploads, and municipal administration's data include city appearance environment health subject data, municipal subject data, urban afforestation subject data, city Plan subject data and industrial and commercial administration subject data in city;The type of community's text data include TXT forms, html format, XML format;Server-side system carries out data preprocessing module, vectorial extraction module, much-talked-about topic extraction module, data visualization The execution of module and data storage and management module;
Data prediction step:Community's text data of community's text data uploading step is read, and carries out community's text data Cleaning and Chinese word segmentation;Community's text data reads and takes different reading strategies for different types of data, right BufferedReader directly reads into data-stream form in community's text data of TXT forms is using JAVA, for HTML Community's text data with XML format uses the interpretive model of DOM;The cleaning of community's text data is completed to community's text The carrying out for having repetition to report in notebook data is rejected;The Chinese word segmentation is that community's text data after cleaning is cut into by Chinese single The word feature vector of word composition;
Vectorial extraction step:The word feature vector obtained after data prediction step is trained based on Chinese corpus, is carried Keyword phrase is taken, and calculates keyword weights;Average computation is weighted with reference to word feature vector and keyword weights to obtain Go out Text eigenvector;The word feature vector training uses the version of java of Word2Vec, Chinese corpus to use project institute In the news corpus storehouse on ground;Keyword phrase is mentioned using TF-IDF feature extractions;
Much-talked-about topic extraction step:Based on the keyword phrase drawn in vectorial extraction step and the characteristic vector of text, use Single-Pass clusters are clustered to text, are obtained after class cluster according to the keyword phrase extracted in vectorial extraction step, Keyword in keyword phrase is counted, descending arrangement after the completion of statistics, so as to generate much-talked-about topic;
Data visualization step:It is User Interface, external application and displaying task is completed, in community's text uploading step The community's text data for obtaining and much-talked-about topic extraction step in the much-talked-about topic data display that obtains at page end;Displaying The much-talked-about topic generated after data total amount overview of the content including community's text data, data distribution overview and data analysis, heat The display form of point topic intuitively shows number using various display forms of form, block diagram, broken line graph and combination map According to;The data entirety overview display data overview, the total species of data, data total amount and each area, total amount of theme;Number Data distribution situation is shown on map according to distribution overview, intuitively display data geographical position;Data analysis overview The much-talked-about topic generated after treatment is showed into Community Administrators using form, chart and with reference to forms such as maps;
Data storage and management step, it is pre- to the community text data step that is uploaded in community's text data uploading step, data The related data produced in process step, vectorial extraction step and much-talked-about topic extraction step carries out storage and management;Community Community's text data that text data uploading step is uploaded is buffered in HDFS file system, the word trained in vectorial extraction step Vector result is cached in Redis cache databases, and the data of data prediction step and the generation of much-talked-about topic extraction step are delayed In there is HBase databases;HDFS file system, HBase databases and Redis cache databases are managed, it is completed In data increase, delete, modification and inquiry operation;Support timed task to the caching number in Redis cache databases simultaneously According to being updated, and the concordance list in HBase databases is safeguarded, optimize the inquiry to data, to HDFS file system Chinese The storage of part block merges optimization processing.Access HDFS and HBase and use original API, access Redis using packaged Jedis JAR bags.
CN201710115832.1A 2017-03-01 2017-03-01 A kind of topic based on community's text data finds system Pending CN106777395A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710115832.1A CN106777395A (en) 2017-03-01 2017-03-01 A kind of topic based on community's text data finds system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710115832.1A CN106777395A (en) 2017-03-01 2017-03-01 A kind of topic based on community's text data finds system

Publications (1)

Publication Number Publication Date
CN106777395A true CN106777395A (en) 2017-05-31

Family

ID=58960207

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710115832.1A Pending CN106777395A (en) 2017-03-01 2017-03-01 A kind of topic based on community's text data finds system

Country Status (1)

Country Link
CN (1) CN106777395A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108376175A (en) * 2018-03-02 2018-08-07 成都睿码科技有限责任公司 Visualization method for displaying news events
CN109359302A (en) * 2018-10-26 2019-02-19 重庆大学 A kind of optimization method of field term vector and fusion sort method based on it
CN109525740A (en) * 2018-10-12 2019-03-26 成都北科维拓科技有限公司 A kind of event-handling method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103177024A (en) * 2011-12-23 2013-06-26 微梦创科网络科技(中国)有限公司 Method and device of topic information show
CN105022840A (en) * 2015-08-18 2015-11-04 新华网股份有限公司 News information processing method, news recommendation method and related devices
CN105138592A (en) * 2015-07-31 2015-12-09 武汉虹信技术服务有限责任公司 Distributed framework-based log data storing and retrieving method
CN105718590A (en) * 2016-01-27 2016-06-29 福州大学 Multi-tenant oriented SaaS public opinion monitoring system and method
US20160196174A1 (en) * 2015-01-02 2016-07-07 Tata Consultancy Services Limited Real-time categorization of log events
CN106202566A (en) * 2016-08-02 2016-12-07 山东鲁能软件技术有限公司 A kind of magnanimity electricity consumption data mixing based on big data storage system and method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103177024A (en) * 2011-12-23 2013-06-26 微梦创科网络科技(中国)有限公司 Method and device of topic information show
US20160196174A1 (en) * 2015-01-02 2016-07-07 Tata Consultancy Services Limited Real-time categorization of log events
CN105138592A (en) * 2015-07-31 2015-12-09 武汉虹信技术服务有限责任公司 Distributed framework-based log data storing and retrieving method
CN105022840A (en) * 2015-08-18 2015-11-04 新华网股份有限公司 News information processing method, news recommendation method and related devices
CN105718590A (en) * 2016-01-27 2016-06-29 福州大学 Multi-tenant oriented SaaS public opinion monitoring system and method
CN106202566A (en) * 2016-08-02 2016-12-07 山东鲁能软件技术有限公司 A kind of magnanimity electricity consumption data mixing based on big data storage system and method

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108376175A (en) * 2018-03-02 2018-08-07 成都睿码科技有限责任公司 Visualization method for displaying news events
CN108376175B (en) * 2018-03-02 2022-05-13 成都睿码科技有限责任公司 Visualization method for displaying news events
CN109525740A (en) * 2018-10-12 2019-03-26 成都北科维拓科技有限公司 A kind of event-handling method and system
CN109525740B (en) * 2018-10-12 2021-01-26 成都北科维拓科技有限公司 Event processing method and system
CN109359302A (en) * 2018-10-26 2019-02-19 重庆大学 A kind of optimization method of field term vector and fusion sort method based on it

Similar Documents

Publication Publication Date Title
US11580104B2 (en) Method, apparatus, device, and storage medium for intention recommendation
CN105468605B (en) Entity information map generation method and device
Bozarth et al. Toward a better performance evaluation framework for fake news classification
CN103324665B (en) Hot spot information extraction method and device based on micro-blog
CN104182389B (en) A kind of big data analyzing business intelligence service system based on semanteme
CN105488196B (en) A kind of hot topic automatic mining system based on interconnection corpus
CN105045875B (en) Personalized search and device
Zhou et al. Real-time news cer tification system on sina weibo
CN111104511B (en) Method, device and storage medium for extracting hot topics
CN106383887A (en) Environment-friendly news data acquisition and recommendation display method and system
CN102890702A (en) Internet forum-oriented opinion leader mining method
CN102546771A (en) Cloud mining network public opinion monitoring system based on characteristic model
CN102622443A (en) Customized screening system and method for microblog
CN104834693A (en) Depth-search-based visual image searching method and system thereof
CN103324666A (en) Topic tracing method and device based on micro-blog data
CN103678670A (en) Micro-blog hot word and hot topic mining system and method
CN107291886A (en) A kind of microblog topic detecting method and system based on incremental clustering algorithm
CN103020159A (en) Method and device for news presentation facing events
CN105378730A (en) Social media content analysis and output
CN105718590A (en) Multi-tenant oriented SaaS public opinion monitoring system and method
CN104142995A (en) Social event recognition method based on visual attributes
CN103577404A (en) Microblog-oriented discovery method for new emergencies
CN103412903B (en) The Internet of Things real-time searching method and system predicted based on object of interest
CN106294473B (en) Entity word mining method, information recommendation method and device
CN103886020A (en) Quick search method of real estate information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170531