CN109948161A

CN109948161A - Data processing method and device for Chinese public sentiment

Info

Publication number: CN109948161A
Application number: CN201910213894.5A
Authority: CN
Inventors: 迟耀明
Original assignee: Beijing Deep-Sea Giant Whale Mdt Infotech Ltd
Current assignee: BEIJING YULESHIJIE EDUCATION TECHNOLOGY Co.,Ltd.
Priority date: 2019-03-20
Filing date: 2019-03-20
Publication date: 2019-06-28

Abstract

This application discloses a kind of data processing methods and device for Chinese public sentiment.This method includes the term vector determined in default field；Keyword is filtered out in default dictionary according to the term vector in the default field and generates keyword score；Obtain text data source weight；And by text data source weight and the keyword score, extract Chinese public sentiment hot.Present application addresses the technical problems that the treatment effect of Chinese public sentiment is poor.The accurate extraction to Chinese public sentiment hot not only may be implemented by the application, additionally it is possible to result be extracted according to hot spot and carry out follow-up of hot issues.So that follow-up of hot issues result is more clear and accurately.In addition, the application is applicable to server end or client, selected according to outdoor scene business demand.

Description

Data processing method and device for Chinese public sentiment

Technical field

This application involves public sentiment process field, in particular to a kind of data processing method for Chinese public sentiment and Device.

Background technique

Currently, most of media companies need to obtain public sentiment hot in real time, taken to set different brands according to hot spot Business and Strategy of media.

Inventors have found that the hot spot extraction accuracy for Chinese public sentiment is poor, while public sentiment development grain is showed Also unintelligible.

For the problem that the treatment effect of Chinese public sentiment in the related technology is poor, effective solution side is not yet proposed at present Case.

Summary of the invention

The main purpose of the application is to provide a kind of data processing method and device for Chinese public sentiment, in solving The poor problem of the treatment effect of literary public sentiment.

To achieve the goals above, it according to the one aspect of the application, provides at a kind of data for Chinese public sentiment Reason method.

The term vector in default field is comprised determining that according to the data processing method for Chinese public sentiment of the application；Root Keyword is filtered out in default dictionary according to the term vector in the default field and generates keyword score；Obtain text data Source weight；And by text data source weight and the keyword score, extract Chinese public sentiment hot.

Further, keyword is filtered out in default dictionary according to the term vector in the default field and generates key Word score includes: to extract keyword according to the word frequency and text size that occur in the text, and by presetting non-key dictionary Remove non-key word；The keyword occurred in the text data in target network-wide basis is merged to obtain together according to the term vector Matter keyword；The length of the frequency and position and the text that are occurred in the text according to the homogeneity keyword, described in calculating The score of homogeneity keyword；According to the score of the homogeneity keyword, arrangement obtains having to reserved portion in the text Homogeneity lists of keywords.

Further, obtaining text data source weight includes: according in priori knowledge and the text data source The average amount of reading of text data gives a mark to the text data source, obtains initialization text data source weight；Obtain institute The hot spot frequency of occurrence in text data source is stated, so that by the hot spot frequency of occurrence in the text data source to described first Beginningization text data source weight carries out dynamic regulation.

Further, by text data source weight and the keyword score, Chinese public sentiment hot packet is extracted Include: by the initialization text data source weight carry out dynamic regulation as a result, determining that the text data source is weighed Weight；Determine that the text that occurs in the text data is variant but the identical keyword score of meaning；According to the textual data According to source weight and the keyword score, topic and the corresponding hot value of the topic belonging to text are obtained.

Further, by text data source weight and the keyword score, extract Chinese public sentiment hot it Afterwards further include: the step of tracking the Chinese public sentiment hot, described the step of tracking the Chinese public sentiment hot includes: in public sentiment The hot value of the keyword in data database according to the identical meanings occurred in different text datas is added to obtain total score Numerical value；According to the total score value, the Chinese public sentiment hot to be tracked is determined；Extract the text in target network-wide basis It include keyword identical with the keyword senses in notebook data, and the hot value of the keyword is greater than the institute of preset threshold State text data.

To achieve the goals above, it according to the another aspect of the application, provides at a kind of data for Chinese public sentiment Manage device.

Module is comprised determining that according to the data processing equipment for Chinese public sentiment of the application, for determining default field In term vector；Screening module, for filtering out keyword simultaneously in default dictionary according to the term vector in the default field Generate keyword score；Module is obtained, for obtaining text data source weight；And extraction module, for passing through the text Notebook data source weight and the keyword score extract Chinese public sentiment hot.

Further, the screening module includes: extraction unit, for according to the word frequency and text occurred in the text Length extracts keyword, and removes non-key word by presetting non-key dictionary；Combining unit, being used for will according to the term vector The keyword occurred in text data in target network-wide basis merges to obtain homogeneity keyword；Score calculation unit is used for root The length of the frequency and position and the text that occur in the text according to the homogeneity keyword calculates the homogeneity keyword Score；List cell, for the score according to the homogeneity keyword, arrangement obtains having to reserved portion in the text Homogeneity lists of keywords.

Further, the acquisition module includes: initialization unit, for according to priori knowledge and the text data come The average amount of reading of text data in source gives a mark to the text data source, obtains initialization text data source weight； Dynamic adjustment unit, for obtaining the hot spot frequency of occurrence in the text data source so that by the text data come The hot spot frequency of occurrence in source carries out dynamic regulation to the initialization text data source weight.

Further, the extraction module includes: weight unit, for by weighing to the initialization text data source Again carry out dynamic regulation as a result, determine text data source weight；Homogeneity keyword unit, for determining in the text The text occurred in notebook data is variant but the identical keyword score of meaning；Hot value is calculated according to the text data source Weight and the keyword score obtain topic and the corresponding hot value of the topic belonging to text.

Further, further includes: tracing module, the tracing module include: score unit, in public sentiment data data The hot value of the keyword in library according to the identical meanings occurred in different text datas is added to obtain total score value；It determines Unit, for determining the Chinese public sentiment hot to be tracked according to the total score value；Extraction unit, for extracting target It include keyword identical with the keyword senses in the text data in network-wide basis, and the temperature of the keyword Value is greater than the text data of preset threshold.

In the embodiment of the present application, by the way of determining the term vector in default field, according in the default field Term vector filter out keyword in default dictionary and generate keyword score；By obtaining text data source weight, reach The purpose that Chinese public sentiment hot is extracted by text data source weight and the keyword score is arrived, to realize The technical effect that public sentiment hot is precisely extracted, and then the technical problem that the treatment effect that solves Chinese public sentiment is poor.

Detailed description of the invention

The attached drawing constituted part of this application is used to provide further understanding of the present application, so that the application's is other Feature, objects and advantages become more apparent upon.The illustrative examples attached drawing and its explanation of the application is for explaining the application, not Constitute the improper restriction to the application.In the accompanying drawings:

Fig. 1 is according to the data processing method flow diagram for Chinese public sentiment in the application first embodiment；

Fig. 2 is according to the data processing method flow diagram for Chinese public sentiment in the application second embodiment；

Fig. 3 is according to the data processing method flow diagram for Chinese public sentiment in the application 3rd embodiment；

Fig. 4 is according to the data processing method flow diagram for Chinese public sentiment in the application fourth embodiment；

Fig. 5 is according to the data processing method flow diagram for Chinese public sentiment in the 5th embodiment of the application；

Fig. 6 is according to the data processing equipment structural diagram for Chinese public sentiment in the application first embodiment；

Fig. 7 is according to the data processing equipment structural diagram for Chinese public sentiment in the application second embodiment；

Fig. 8 is according to the data processing equipment structural diagram for Chinese public sentiment in the application 3rd embodiment；

Fig. 9 is according to the data processing equipment structural diagram for Chinese public sentiment in the application fourth embodiment；

Figure 10 is according to the data processing equipment structural diagram for Chinese public sentiment in the 5th embodiment of the application.

Specific embodiment

In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only The embodiment of the application a part, instead of all the embodiments.Based on the embodiment in the application, ordinary skill people Member's every other embodiment obtained without making creative work, all should belong to the model of the application protection It encloses.

It should be noted that term " includes " and " tool in the description and claims of this application and above-mentioned attached drawing Have " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing a series of steps or units Process, method, system, product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include without clear Other step or units listing to Chu or intrinsic for these process, methods, product or equipment.

The data processing method for Chinese public sentiment in the application, is instructed respectively by using the training corpus of specific area The term vector for practicing the field, the public sentiment hot in the field can be more accurately extracted using the term vector of this specialized training.This Outside, by long-term accumulation, very valuable non-key dictionary is established, helps to solve redundancy occur in hot spot extraction process Or the problem of interference information.And information source weighing computation method and corresponding weight are established for the whole network information, so that Hot spot extracts and tracking result is apparent, more acurrate.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

As shown in Figure 1, this method includes the following steps, namely S102 to step S108:

Step S102 determines the term vector in default field；

Default field can be designated field or specific area, and designated field can be determined according to different user demands, special Determining field can determine according to different industries feature.It is not defined in embodiments herein, as long as can satisfy really Determine the requirement in field.

The term vector is the vector characterization of the word in natural language processing.

Specifically, it is determined that when term vector in specific area needing that specific area corpus is used to obtain as training data Term vector model.And in the test process to the term vector model, expert/length in the specific area is invited Phase practitioner determines term vector dimension and window size after comparing to the training result of different parameters the two are crucial Training parameter (or result using machine training), finally obtains the term vector of specific area.

The term vector in the field is respectively trained using the training corpus of specific area, can more accurately extract the field Public sentiment hot.

Step S104 filters out keyword in default dictionary according to the term vector in the default field and generates key Word score；

According to the term vector in the default field obtained in above-mentioned steps, institute can be filtered out in preset dictionary It states keyword and then generates the keyword score after giving a mark to the obtained keyword.

The default dictionary can be selected according to specific area.Specifically, it configures the default dictionary to When non-key dictionary, it can be used for solving the problems, such as redundancy or interference information occur in the extraction process of Chinese public sentiment hot.

Step S106 obtains text data source weight；

Text data source weight needs to obtain in advance as the parameter for calculating hot value.It should be noted that Text data source weight is not limited specifically in embodiments herein, as long as can satisfy acquisition textual data According to the condition of source weight.

Step S108 extracts Chinese public sentiment hot by text data source weight and the keyword score.

The text data source weight and corresponding keyword score obtained in through the above steps, can be accurately Extraction obtains Chinese public sentiment hot.It may include the words of hot spot belonging to multiple text datas in the Chinese public sentiment hot Topic.

It can be seen from the above description that the application realizes following technical effect:

According to the embodiment of the present application, as preferred in the present embodiment, as shown in Fig. 2, according in the default field Term vector filters out keyword and generates keyword score in default dictionary

Step S202 extracts keyword according to the word frequency and text size occurred in the text, and by presetting non-pass Keyword library removes non-key word；

The word frequency occurred in the text refers to, the frequency that keyword occurs in the text.Specifically, text can be The length of one article, the probability and this article that are occurred according to keyword in an article extracts keyword.

According to the keyword that extraction obtains, non-key word can be removed by presetting non-key dictionary.Specifically, it compares Non-key dictionary rejects the stop word of extraction.

Simultaneously, it should be noted that in order to guarantee the effect rejected, the non-key dictionary needs dynamic according to operation result State updates, in this application and without specifically limiting, as long as can satisfy the update condition of non-key dictionary.

Step S204 merges the keyword occurred in the text data in target network-wide basis according to the term vector To homogeneity keyword；

According to the term vector by the key of the text data in the target network-wide basis extracted in previous step Word merges rear available homogeneity keyword.

Specifically, the homogeneity keyword refers to that text is variant, but word similar in the meaning expressed.By using word to Amount can merge the operation of homogeneity keyword.For example, " working " keyword and " work " keyword, belong to homogeneity keyword It needs to merge.

It should be noted that the concrete mode for using term vector to merge homogeneity keyword in embodiments herein not It is defined, as long as can satisfy merging condition.

Step S206, the length of the frequency and position and the text that are occurred in the text according to the homogeneity keyword, Calculate the score of the homogeneity keyword；

The frequency that is occurred in the text according to the homogeneity keyword and in the text appearance position, in conjunction with the length of the text Degree, can calculate the score of the homogeneity keyword of acquisition.

Specifically, text can be an article.The position of the frequency, appearance that are occurred in article according to homogeneity keyword And the length of article, calculate homogeneity keyword score.For example, the position in article can be in literary first, text or the end of writing.

Step S208, according to the score of the homogeneity keyword, arrangement obtains having to reserved portion in the text Homogeneity lists of keywords.

Arrangement, which obtains having in the text, can visualize the homogeneity lists of keywords of reserved portion.

Specifically, text can be an article.It is arranged by the homogeneity keyword of the available every article of above-mentioned calculating Table and its corresponding score.

In above-mentioned steps, by combining the term vector of conventional machines study, non-key dictionary, specific area, homogeneity is carried out The extraction of keyword can accurately obtain the score and keyword of keyword in multiple text datas.

According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 3, obtaining text data source weight Include:

Step S302, according to the average amount of reading of the text data in priori knowledge and the text data source to described The marking of text data source obtains initialization text data source weight；

The priori knowledge can be knowledge known to expert or technical staff in specific area.

The specific area for needing to acquire larger data concentration to text data source marking according to priori knowledge In expert or technical staff marking result.

The average amount of reading of the text data is only used as one of the present embodiment preferred embodiment, in the application Embodiment in and without specifically limiting, as long as can satisfy marking require.

Specifically, when determining the weight in the text data source, text data source weight is given a mark by domain expert (data weighting is from scratch) is cold-started with the average amount of reading of text data source.

Step S304 obtains the hot spot frequency of occurrence in the text data source so that by the text data come The hot spot frequency of occurrence in source carries out dynamic regulation to the initialization text data source weight.

If the hot spot frequency of occurrence in the text data source is more, the weight of text data source is also got over Greatly.

According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 4, passing through the text data source Weight and the keyword score, extracting Chinese public sentiment hot includes:

Step S402, by it is described initialization text data source weight carry out dynamic regulation as a result, determine described in Text data source weight；

The textual data can be determined by the result that the initialization text data source weight carries out dynamic regulation According to the weight in source.

Step S404 determines that the text that occurs in the text data is variant but the identical keyword score of meaning；

The identical keyword of variant but meaning by the text occurred in the text data, can obtain to the pass The marking result of keyword.Preferably, the text is variant but the identical keyword of meaning can be homogeneity keyword and It is screened by non-key word database.

Step S406 obtains topic belonging to text according to text data source weight and the keyword score And the corresponding hot value of the topic.

According to text data source weight and the keyword score, the affiliated topic of the available text and institute The corresponding hot value of topic is stated, specifically, such as can be using topic label as the list of homogeneity keyword, so as to show text Topic and corresponding hot value belonging to this.

According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 5, passing through the text data source Weight and the keyword score, after extracting Chinese public sentiment hot further include: the step of the tracking Chinese public sentiment hot, institute Stating the step of tracking the Chinese public sentiment hot includes:

Step S502, according to the key of the identical meanings occurred in different text datas in public sentiment data database The hot value of word is added to obtain total score value；

The public sentiment data library can be based on the built-in vertical Chinese whole network data library of the longer time limit, may be implemented in real time Cleaning and dynamic update.Public sentiment data library storage content includes data source, time, content of text, fields etc..

It should be noted that in embodiments herein, for the public sentiment data library and without specifically defined, As long as can satisfy public sentiment data library establishes maintenance requirement.

In the public sentiment data database, according to the keyword of the identical meanings occurred in different text datas Hot value carries out being added available total score value after being added.

Step S504 determines the Chinese public sentiment hot to be tracked according to the total score value；

The Chinese wait track (customer demand or industry requirement) is determined according to the total score value obtained in last step Public sentiment hot.

Step S506, it includes identical as the keyword senses for extracting in the text data in target network-wide basis Keyword, and the hot value of the keyword be greater than preset threshold the text data.

Preset threshold can be selected according to specific area, in embodiments herein and without specifically limiting.

Specifically, by reading in public sentiment data library in designated field, all data within the scope of specified time, heat is used Degree parser obtains the homogeneity lists of keywords and score of every article, the heat of homogeneity keyword identical in different articles Angle value is added, and the hot spot for needing to be tracked in the time range is determined according to score value.

The designated field, specified time range, can determine according to demand, such as return in specific area from current time The result in one hour traced back.

Homogeneity keyword having the same in whole network data is extracted, and article hot value is greater than the article of specified threshold, Those articles are arranged according to time tag, obtain public sentiment tracking as a result, being visualized.

When obtaining the hot value of the article, it is also necessary to be pressed according to hot value obtained in specifically used temperature calculation method Length apart from current time decays.

It should be noted that step shown in the flowchart of the accompanying drawings can be in such as a group of computer-executable instructions It is executed in computer system, although also, logical order is shown in flow charts, and it in some cases, can be with not The sequence being same as herein executes shown or described step.

According to the embodiment of the present application, additionally provide a kind of for implementing the data processing for Chinese public sentiment of the above method Device, as shown in fig. 6, the device comprises determining that module 10, for determining the term vector in default field；Screening module 20 is used In filtering out keyword in default dictionary according to the term vector in the default field and generate keyword score；Obtain module 30, for obtaining text data source weight；And extraction module 40, for passing through text data source weight and described Keyword score extracts Chinese public sentiment hot.

Field is preset in the determining module 10 of the embodiment of the present application can be designated field or specific area, and designated field can To be determined according to different user demands, specific area can be determined according to different industries feature.In embodiments herein simultaneously Without limiting, as long as can satisfy the requirement in determining field.

According to the term vector in the default field obtained in above-mentioned steps in the screening module 20 of the embodiment of the present application, The keyword can be filtered out in preset dictionary and then generates the pass after giving a mark to the obtained keyword Keyword score.

Text data source weight described in the acquisition module 30 of the embodiment of the present application is needed as the parameter for calculating hot value It to obtain in advance.

It should be noted that not limited specifically text data source weight in embodiments herein It is fixed, as long as can satisfy the condition for obtaining text data source weight.

In the extraction module 40 of the embodiment of the present application through the above steps in obtain text data source weight and Corresponding keyword score, can accurately extract to obtain Chinese public sentiment hot.It may include more in the Chinese public sentiment hot Hot topic belonging to a text data.

According to the embodiment of the present application, as preferred in the present embodiment, as shown in fig. 7, the screening module 20 includes: to mention Unit 201 is taken, for extracting keyword according to the word frequency and text size occurred in the text, and by presetting non-key word Library removes non-key word；Combining unit 202, for will be occurred in the text data in target network-wide basis according to the term vector Keyword merge to obtain homogeneity keyword；Score calculation unit 203, for being occurred in the text according to the homogeneity keyword Frequency and position and the text length, calculate the score of the homogeneity keyword；List cell 204, for according to institute The score of homogeneity keyword is stated, arrangement obtains the homogeneity lists of keywords having to reserved portion in the text.

The word frequency occurred in text described in the extraction unit 201 of the embodiment of the present application refers to, keyword in the text The frequency of appearance.Specifically, text can be an article, the probability and this text occurred according to keyword in an article The length of chapter extracts keyword.

The mesh that will be extracted in previous step in the combining unit 202 of the embodiment of the present application according to the term vector The keyword of text data in mark network-wide basis merges rear available homogeneity keyword.

The frequency that is occurred in the text in the score calculation unit 203 of the embodiment of the present application according to the homogeneity keyword and Appearance position in the text can calculate the score for obtaining the homogeneity keyword in conjunction with the length of the text.

Arrangement obtains crucial with the homogeneity to reserved portion in the text in the list cell 204 of the embodiment of the present application Word list can be visualized.

According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 8, the acquisition module 30 includes: just Beginningization unit 301, for according to the average amount of reading of the text data in priori knowledge and the text data source to described The marking of text data source obtains initialization text data source weight；Dynamic adjustment unit 302, for obtaining the text Hot spot frequency of occurrence in data source, so that by the hot spot frequency of occurrence in the text data source to the initialization text Notebook data source weight carries out dynamic regulation.

Priori knowledge described in the initialization unit 301 of the embodiment of the present application can be expert or technology in specific area Knowledge known to personnel.

If the hot spot frequency of occurrence in the text data source is got in the dynamic adjustment unit 302 of the embodiment of the present application More, the weight of text data source is also bigger.

According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 9, the extraction module 40 includes: power Weight unit 401, for by the initialization text data source weight carry out dynamic regulation as a result, determining the text Data source weight；Homogeneity keyword unit 402, but meaning variant for the determining text occurred in the text data Identical keyword score；Hot value calculates 403, is used for according to text data source weight and the keyword score, Obtain topic and the corresponding hot value of the topic belonging to text.

Dynamic regulation is carried out by the initialization text data source weight in the weight unit 401 of the embodiment of the present application Result can determine the weight in the text data source.

It is variant by the text occurred in the text data in the homogeneity keyword unit 402 of the embodiment of the present application But the identical keyword of meaning can obtain the marking result to the keyword.Preferably, the text is variant but meaning phase Same keyword can be homogeneity keyword and have already passed through non-key word database screening.

The hot value of the embodiment of the present application, which calculates, to be obtained in 403 according to text data source weight and the keyword Point, the available affiliated topic of text and the corresponding hot value of the topic, specifically, for example, can using topic label as The list of homogeneity keyword, so as to the affiliated topic of display text and corresponding hot value.

According to the embodiment of the present application, as preferred in the present embodiment, as shown in Figure 10, further includes: tracing module 50, institute Stating tracing module 50 includes: score unit 501, in public sentiment data database according to the phase occurred in different text datas Hot value with the keyword of meaning is added to obtain total score value；Determination unit 502 is used for according to the total score value, Determine the Chinese public sentiment hot to be tracked；Extraction unit 503, for extracting the text data in target network-wide basis In include keyword identical with the keyword senses, and the hot value of the keyword is greater than the text of preset threshold Data.

Public sentiment data library described in the score unit 501 of the embodiment of the present application can be built-in vertical based on the longer time limit Chinese whole network data library may be implemented to clean in real time and dynamic update.Public sentiment data library storage content include data source, when Between, content of text, fields etc..

It is determined according to the total score value obtained in last step wait track (visitor in the determination unit 502 of the embodiment of the present application Family demand or industry requirement) the Chinese public sentiment hot.

It is extracted in the extraction unit 503 of the embodiment of the present application in the text data in target network-wide basis and includes There is keyword identical with the keyword senses, and the hot value of the keyword is greater than the text data of preset threshold.

Homogeneity keyword having the same in whole network data is extracted, and the hot value of homogeneity keyword is greater than specified threshold Article arranges those articles according to time tag, obtains public sentiment tracking as a result, being visualized.

Obviously, those skilled in the art should be understood that each module of above-mentioned the application or each step can be with general Computing device realize that they can be concentrated on a single computing device, or be distributed in multiple computing devices and formed Network on, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored Be performed by computing device in the storage device, perhaps they are fabricated to each integrated circuit modules or by they In multiple modules or step be fabricated to single integrated circuit module to realize.In this way, the application be not limited to it is any specific Hardware and software combines.

The foregoing is merely preferred embodiment of the present application, are not intended to limit this application, for the skill of this field For art personnel, various changes and changes are possible in this application.Within the spirit and principles of this application, made any to repair Change, equivalent replacement, improvement etc., should be included within the scope of protection of this application.

Claims

1. a kind of data processing method for Chinese public sentiment characterized by comprising

Determine the term vector in default field；

Keyword is filtered out in default dictionary according to the term vector in the default field and generates keyword score；

Obtain text data source weight；And

By text data source weight and the keyword score, Chinese public sentiment hot is extracted.

2. the data processing method according to claim 1 for Chinese public sentiment, which is characterized in that according to the default neck Term vector in domain filters out keyword and generate keyword score in default dictionary includes:

Keyword is extracted according to the word frequency and text size occurred in the text, and is gone by presetting non-key dictionary unless closing Keyword；

The keyword occurred in the text data in target network-wide basis is merged according to the term vector to obtain homogeneity keyword；

The length of the frequency and position and the text that are occurred in the text according to the homogeneity keyword calculates the homogeneity and closes The score of keyword；

According to the score of the homogeneity keyword, arrangement obtains arranging in the text with the homogeneity keyword to reserved portion Table.

3. the data processing method according to claim 1 for Chinese public sentiment, which is characterized in that obtain text data Source weight includes:

According to the average amount of reading of the text data in priori knowledge and the text data source to the text data source Marking obtains initialization text data source weight；

The hot spot frequency of occurrence in the text data source is obtained, so as to go out occurrence by the hot spot in the text data source It is several that dynamic regulation is carried out to the initialization text data source weight.

4. the data processing method according to claim 1 for Chinese public sentiment, which is characterized in that pass through the textual data According to source weight and the keyword score, extracting Chinese public sentiment hot includes:

By to the initialization text data source weight carry out dynamic regulation as a result, determining that the text data source is weighed Weight；

Determine that the text that occurs in the text data is variant but the identical keyword score of meaning；

According to text data source weight and the keyword score, topic and the topic pair belonging to text are obtained The hot value answered.

5. the data processing method according to claim 1 for Chinese public sentiment, which is characterized in that pass through the textual data According to source weight and the keyword score, after the Chinese public sentiment hot of extraction further include: the tracking Chinese public sentiment hot The step of step, the tracking Chinese public sentiment hot includes:

According to the hot value phase of the keyword of the identical meanings occurred in different text datas in public sentiment data database Add to obtain total score value；

According to the total score value, the Chinese public sentiment hot to be tracked is determined；

Extracting in the text data in target network-wide basis includes keyword identical with the keyword senses, and should The hot value of keyword is greater than the text data of preset threshold.

6. a kind of data processing equipment for Chinese public sentiment characterized by comprising

Determining module, for determining the term vector in default field；

Screening module, for filtering out keyword in default dictionary according to the term vector in the default field and generating key Word score；

Module is obtained, for obtaining text data source weight；And

Extraction module, for extracting Chinese public sentiment hot by text data source weight and the keyword score.

7. the data processing equipment according to claim 6 for Chinese public sentiment, which is characterized in that the screening module packet It includes:

Extraction unit, for extracting keyword according to the word frequency and text size occurred in the text, and by presetting non-pass Keyword library removes non-key word；

Combining unit, for being merged the keyword occurred in the text data in target network-wide basis according to the term vector To homogeneity keyword；

Score calculation unit, the length of frequency and position and the text for being occurred in the text according to the homogeneity keyword Degree, calculates the score of the homogeneity keyword；

List cell, for the score according to the homogeneity keyword, arrangement obtains having to reserved portion in the text Homogeneity lists of keywords.

8. the data processing equipment according to claim 6 for Chinese public sentiment, which is characterized in that the acquisition module packet It includes:

Initialization unit, for according to the average amount of reading of the text data in priori knowledge and the text data source to institute The marking of text data source is stated, initialization text data source weight is obtained；

Dynamic adjustment unit, for obtaining the hot spot frequency of occurrence in the text data source, so as to pass through the textual data Dynamic regulation is carried out to the initialization text data source weight according to the hot spot frequency of occurrence in source.

9. the data processing equipment according to claim 6 for Chinese public sentiment, which is characterized in that the extraction module packet It includes:

Weight unit, for by it is described initialization text data source weight carry out dynamic regulation as a result, determine described in Text data source weight；

Homogeneity keyword unit, but meaning identical keyword variant for the determining text occurred in the text data Score；

Hot value calculates, for obtaining belonging to text according to text data source weight and the keyword score Topic and the corresponding hot value of the topic.

10. the data processing equipment according to claim 6 for Chinese public sentiment, which is characterized in that further include: tracking mould Block, the tracing module include:

Score unit, for the key in public sentiment data database according to the identical meanings occurred in different text datas The hot value of word is added to obtain total score value；

Determination unit, for determining the Chinese public sentiment hot to be tracked according to the total score value；

Extraction unit in the text data in target network-wide basis includes identical as the keyword senses for extracting Keyword, and the hot value of the keyword be greater than preset threshold the text data.