A kind of large data platform realizing tax the analysis of public opinion and full-text search
Technical field
The present invention relates to large data technique field, specifically a kind of practical, large data platform of realizing tax the analysis of public opinion and full-text search.
Background technology
In order to make full use of tax public sentiment Internet resource, make the large data platform that internet adds tax public sentiment, need for user provides the large data platform Frame Design of public sentiment of the large data sampling and processing of a whole set of public sentiment, analysis, storage, index and search, based on this, now provide a kind of large data platform realizing tax the analysis of public opinion and full-text search.
Summary of the invention
Technical assignment of the present invention is for above weak point, provides a kind of practical, large data platform of realizing tax the analysis of public opinion and full-text search.
Realize a large data platform for tax the analysis of public opinion and full-text search, this platform comprises basic data layer, large data collection layer, large data storage layer, large data analysis layer, large data directory layer, large data search layer and large data application layer from the bottom to top; Wherein, basic data layer is the basic data source that Tax is inner relevant to public sentiment; Large data collection layer provides public feelings information acquisition mode for user, completes the information acquisition to basic data layer; The large data that large data collection layer gathers by large data analysis layer are carried out processing, process, analyze, are excavated, and then mail to large data storage layer; Large data storage layer is for storing the large data message of public sentiment webpage and basic data information; Large data directory layer is that the data that large data storage layer stores set up the passage of inverted index, for large data search layer provides fast searching data source; Large data search layer provides the full-text search engine of large data; Large data application layer is for completing effective utilization of the resource of the large data of tax public sentiment after to search.
The built-in basic data source of described basic data layer comprises taxpayer's register storehouse, unit of tax authority storehouse, policies and regulations storehouse and tax category storehouse, providing by basic data source, for whole platform oriented acquisition network tax public feelings information provides foundation.
Described large data collection layer provides the acquisition mode comprising oriented acquisition, web crawlers and collection rule configuration module, wherein oriented acquisition refers to and carries out oriented acquisition for basic data, web crawlers is for capturing the information of public sentiment website, and this web crawlers is the degree of depth web crawlers crawling seed website; Collection rule configuration module provides basic web retrieval configuration rule function.
Described large data analysis layer comprises pre-service, participle and sentiment analysis three part, and wherein pre-service is the pretreatment operation carrying out removing label, word extraction, stress release treatment to the webpage gathered; Participle be participle carried out to text, remove stop words, the operation of Entity recognition, structure term vector; Sentiment analysis be extracted by emotion information, emotion information classification analyzes text, judges the Sentiment orientation of the text: i.e. front, negative, neutral.
Described large data storage layer comprises public sentiment info web storehouse and basic resource storehouse, and wherein public sentiment info web library storage comprises the destructuring public feelings information of web page contents of the original web page of collection, the webpage of process, picture, video, pattern file, extraction; Basic resource storehouse supplements the information in the large storehouse of basic resource layer four, and this packets of information of supplementing is containing taxpayer official website and news information, unit of taxation authority official website and news information, policies and regulations original text webpage, documentation and the Internet resources comprising tax category encyclopaedia.
Described large data directory layer comprises increment index, full dose index, index additions and deletions and index upgrade, and wherein increment index provides increment to set up index operation for data source; Full dose index all rebuilds index operation for data source provides; Index additions and deletions provide index to increase and deletion action; Index upgrade provides the renewal rewards theory to having indexed.
Described large data search layer comprises full-text search, synonym search, classified statistics, the several module of fuzzy matching, and wherein full-text search provides the matching inquiry of each word to index content, each word, and is fed back by Query Result; Synonym search provides the synonym function of simultaneously inquiring about this word during an inquiry word; Classified statistics provide by type for Search Results or other specify the classified statistics function of classifying; Fuzzy matching is inquired about after carrying out participle to search phrase again.
Described large data application layer comprises classification public sentiment, negative public sentiment, public sentiment search, public sentiment report, and classification public sentiment is the segmentation combination based on public feelings information four kinds of fundamental types of taxpayer, institutional settings, policies and regulations, the tax category and front, negative, neutral group; Negative public sentiment provides real-time negative public sentiment monitoring and follows the tracks of; Public sentiment search provides the search pattern of user's many condition, and search here comprises single search, combinatorial search, Advanced Search etc.; Public sentiment report provides monthly by week public sentiment Statistical Analysis Report per diem.
A kind of large data platform realizing tax the analysis of public opinion and full-text search of the present invention, has the following advantages:
A kind of large data platform realizing tax the analysis of public opinion and full-text search of this invention fully merges tax internal data and the large data of internet tax public sentiment, for revenue department grasps the public feelings information on the internet such as taxpayer, tax administrative unit, policies and regulations, the various tax category in advance, improve monitoring and the adaptibility to response of tax office reply public sentiment, practical, applied widely, be easy to promote.
Accompanying drawing explanation
Accompanying drawing 1 is one-piece construction schematic diagram of the present invention.
Embodiment
Below in conjunction with the drawings and specific embodiments, the invention will be further described.
Of the present inventionly provide a kind of large data platform realizing tax the analysis of public opinion and full-text search, as shown in Figure 1, this platform comprises basic data layer, large data collection layer, large data storage layer, large data analysis layer, large data directory layer, large data search layer and large data application layer from the bottom to top; Wherein, basic data layer is the basic data source that Tax is inner relevant to public sentiment; Large data collection layer provides public feelings information acquisition mode for user, completes the information acquisition to basic data layer; The large data that large data collection layer gathers by large data analysis layer are carried out processing, process, analyze, are excavated, and then mail to large data storage layer; Large data storage layer is for storing the large data message of public sentiment webpage and basic data information; Large data directory layer is that the data that large data storage layer stores set up the passage of inverted index, for large data search layer provides fast searching data source; Large data search layer provides the full-text search engine of large data; Large data application layer is for completing effective utilization of the resource of the large data of tax public sentiment after to search.
The built-in basic data source of described basic data layer comprises taxpayer's register storehouse, unit of tax authority storehouse, policies and regulations storehouse and tax category storehouse, providing by basic data source, for whole platform oriented acquisition network tax public feelings information provides foundation.
Described large data collection layer provides the acquisition mode comprising oriented acquisition, web crawlers and collection rule configuration module, wherein oriented acquisition refers to and carries out oriented acquisition for basic data, web crawlers is for capturing the information of public sentiment website, and this web crawlers is the degree of depth web crawlers crawling seed website; Collection rule configuration module provides basic web retrieval configuration rule function.
Described large data analysis layer comprises pre-service, participle and sentiment analysis three part, and wherein pre-service is the pretreatment operation carrying out removing label, word extraction, stress release treatment to the webpage gathered; Participle be participle carried out to text, remove stop words, the operation of Entity recognition, structure term vector; Sentiment analysis be extracted by emotion information, emotion information classification analyzes text, judges the Sentiment orientation of the text: i.e. front, negative, neutral.
Described large data storage layer comprises public sentiment info web storehouse and basic resource storehouse, and wherein public sentiment info web library storage comprises the destructuring public feelings information of web page contents of the original web page of collection, the webpage of process, picture, video, pattern file, extraction; Basic resource storehouse supplements the information in the large storehouse of basic resource layer four, and this packets of information of supplementing is containing taxpayer official website and news information, unit of taxation authority official website and news information, policies and regulations original text webpage, documentation and the Internet resources comprising tax category encyclopaedia.
Described large data directory layer comprises increment index, full dose index, index additions and deletions and index upgrade, and wherein increment index provides increment to set up index operation for data source; Full dose index all rebuilds index operation for data source provides; Index additions and deletions provide index to increase and deletion action; Index upgrade provides the renewal rewards theory to having indexed.
Described large data search layer comprises full-text search, synonym search, classified statistics, the several module of fuzzy matching, and wherein full-text search provides the matching inquiry of each word to index content, each word, and is fed back by Query Result; Synonym search provides the synonym function of simultaneously inquiring about this word during an inquiry word; Classified statistics provide by type for Search Results or other specify the classified statistics function of classifying; Fuzzy matching is inquired about after carrying out participle to search phrase again.
Described large data application layer comprises classification public sentiment, negative public sentiment, public sentiment search, public sentiment report, and classification public sentiment is the segmentation combination based on public feelings information four kinds of fundamental types of taxpayer, institutional settings, policies and regulations, the tax category and front, negative, neutral group; Negative public sentiment provides real-time negative public sentiment monitoring and follows the tracks of; Public sentiment search provides the search pattern of user's many condition, and search here comprises single search, combinatorial search, Advanced Search etc.; Public sentiment report provides monthly by week public sentiment Statistical Analysis Report per diem.
In actual design process, it is integrated that the present invention utilizes distributed interconnection data acquisition Nutch platform to carry out secondary development, in conjunction with pointedly according to tax office taxpayer register storehouse, policies and regulations storehouse, the whole network search coupling tax public sentiment info web mode gathering its public sentiment info web mode and overflow formula of climbing is carried out associating in tax category storehouse, collect the large data information memory of comprehensive tax public sentiment to Hbase distributed experiment & measurement system, then Hadoop platform is used to carry out the operation of batch type Text Pretreatment to web document data set, re-use the integrated Java of JAVA to increase income natural language processing instrument OpenNLP, FudanNLP, LingPipe, IKAnalyzer, the instrument set Mathout algorithms libraries such as word2vec, complete text participle, sentiment analysis, for every bar public sentiment info web stamps front, neutral and negative three class labels.Finally use the distributed full-text search engine instrument of SolrCloud carry out establishment index and provide full-text search engine, carry out full-text search for user.
Above-mentioned embodiment is only concrete case of the present invention; scope of patent protection of the present invention includes but not limited to above-mentioned embodiment; any according to the invention a kind of realize claims of the large data platform of tax the analysis of public opinion and full-text search and any person of an ordinary skill in the technical field to its suitable change done or replacement, all should fall into scope of patent protection of the present invention.