CN109948161A - Data processing method and device for Chinese public sentiment - Google Patents
Data processing method and device for Chinese public sentiment Download PDFInfo
- Publication number
- CN109948161A CN109948161A CN201910213894.5A CN201910213894A CN109948161A CN 109948161 A CN109948161 A CN 109948161A CN 201910213894 A CN201910213894 A CN 201910213894A CN 109948161 A CN109948161 A CN 109948161A
- Authority
- CN
- China
- Prior art keywords
- keyword
- text
- text data
- public sentiment
- data source
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application discloses a kind of data processing methods and device for Chinese public sentiment.This method includes the term vector determined in default field;Keyword is filtered out in default dictionary according to the term vector in the default field and generates keyword score;Obtain text data source weight;And by text data source weight and the keyword score, extract Chinese public sentiment hot.Present application addresses the technical problems that the treatment effect of Chinese public sentiment is poor.The accurate extraction to Chinese public sentiment hot not only may be implemented by the application, additionally it is possible to result be extracted according to hot spot and carry out follow-up of hot issues.So that follow-up of hot issues result is more clear and accurately.In addition, the application is applicable to server end or client, selected according to outdoor scene business demand.
Description
Technical field
This application involves public sentiment process field, in particular to a kind of data processing method for Chinese public sentiment and
Device.
Background technique
Currently, most of media companies need to obtain public sentiment hot in real time, taken to set different brands according to hot spot
Business and Strategy of media.
Inventors have found that the hot spot extraction accuracy for Chinese public sentiment is poor, while public sentiment development grain is showed
Also unintelligible.
For the problem that the treatment effect of Chinese public sentiment in the related technology is poor, effective solution side is not yet proposed at present
Case.
Summary of the invention
The main purpose of the application is to provide a kind of data processing method and device for Chinese public sentiment, in solving
The poor problem of the treatment effect of literary public sentiment.
To achieve the goals above, it according to the one aspect of the application, provides at a kind of data for Chinese public sentiment
Reason method.
The term vector in default field is comprised determining that according to the data processing method for Chinese public sentiment of the application;Root
Keyword is filtered out in default dictionary according to the term vector in the default field and generates keyword score;Obtain text data
Source weight;And by text data source weight and the keyword score, extract Chinese public sentiment hot.
Further, keyword is filtered out in default dictionary according to the term vector in the default field and generates key
Word score includes: to extract keyword according to the word frequency and text size that occur in the text, and by presetting non-key dictionary
Remove non-key word;The keyword occurred in the text data in target network-wide basis is merged to obtain together according to the term vector
Matter keyword;The length of the frequency and position and the text that are occurred in the text according to the homogeneity keyword, described in calculating
The score of homogeneity keyword;According to the score of the homogeneity keyword, arrangement obtains having to reserved portion in the text
Homogeneity lists of keywords.
Further, obtaining text data source weight includes: according in priori knowledge and the text data source
The average amount of reading of text data gives a mark to the text data source, obtains initialization text data source weight;Obtain institute
The hot spot frequency of occurrence in text data source is stated, so that by the hot spot frequency of occurrence in the text data source to described first
Beginningization text data source weight carries out dynamic regulation.
Further, by text data source weight and the keyword score, Chinese public sentiment hot packet is extracted
Include: by the initialization text data source weight carry out dynamic regulation as a result, determining that the text data source is weighed
Weight;Determine that the text that occurs in the text data is variant but the identical keyword score of meaning;According to the textual data
According to source weight and the keyword score, topic and the corresponding hot value of the topic belonging to text are obtained.
Further, by text data source weight and the keyword score, extract Chinese public sentiment hot it
Afterwards further include: the step of tracking the Chinese public sentiment hot, described the step of tracking the Chinese public sentiment hot includes: in public sentiment
The hot value of the keyword in data database according to the identical meanings occurred in different text datas is added to obtain total score
Numerical value;According to the total score value, the Chinese public sentiment hot to be tracked is determined;Extract the text in target network-wide basis
It include keyword identical with the keyword senses in notebook data, and the hot value of the keyword is greater than the institute of preset threshold
State text data.
To achieve the goals above, it according to the another aspect of the application, provides at a kind of data for Chinese public sentiment
Manage device.
Module is comprised determining that according to the data processing equipment for Chinese public sentiment of the application, for determining default field
In term vector;Screening module, for filtering out keyword simultaneously in default dictionary according to the term vector in the default field
Generate keyword score;Module is obtained, for obtaining text data source weight;And extraction module, for passing through the text
Notebook data source weight and the keyword score extract Chinese public sentiment hot.
Further, the screening module includes: extraction unit, for according to the word frequency and text occurred in the text
Length extracts keyword, and removes non-key word by presetting non-key dictionary;Combining unit, being used for will according to the term vector
The keyword occurred in text data in target network-wide basis merges to obtain homogeneity keyword;Score calculation unit is used for root
The length of the frequency and position and the text that occur in the text according to the homogeneity keyword calculates the homogeneity keyword
Score;List cell, for the score according to the homogeneity keyword, arrangement obtains having to reserved portion in the text
Homogeneity lists of keywords.
Further, the acquisition module includes: initialization unit, for according to priori knowledge and the text data come
The average amount of reading of text data in source gives a mark to the text data source, obtains initialization text data source weight;
Dynamic adjustment unit, for obtaining the hot spot frequency of occurrence in the text data source so that by the text data come
The hot spot frequency of occurrence in source carries out dynamic regulation to the initialization text data source weight.
Further, the extraction module includes: weight unit, for by weighing to the initialization text data source
Again carry out dynamic regulation as a result, determine text data source weight;Homogeneity keyword unit, for determining in the text
The text occurred in notebook data is variant but the identical keyword score of meaning;Hot value is calculated according to the text data source
Weight and the keyword score obtain topic and the corresponding hot value of the topic belonging to text.
Further, further includes: tracing module, the tracing module include: score unit, in public sentiment data data
The hot value of the keyword in library according to the identical meanings occurred in different text datas is added to obtain total score value;It determines
Unit, for determining the Chinese public sentiment hot to be tracked according to the total score value;Extraction unit, for extracting target
It include keyword identical with the keyword senses in the text data in network-wide basis, and the temperature of the keyword
Value is greater than the text data of preset threshold.
In the embodiment of the present application, by the way of determining the term vector in default field, according in the default field
Term vector filter out keyword in default dictionary and generate keyword score;By obtaining text data source weight, reach
The purpose that Chinese public sentiment hot is extracted by text data source weight and the keyword score is arrived, to realize
The technical effect that public sentiment hot is precisely extracted, and then the technical problem that the treatment effect that solves Chinese public sentiment is poor.
Detailed description of the invention
The attached drawing constituted part of this application is used to provide further understanding of the present application, so that the application's is other
Feature, objects and advantages become more apparent upon.The illustrative examples attached drawing and its explanation of the application is for explaining the application, not
Constitute the improper restriction to the application.In the accompanying drawings:
Fig. 1 is according to the data processing method flow diagram for Chinese public sentiment in the application first embodiment;
Fig. 2 is according to the data processing method flow diagram for Chinese public sentiment in the application second embodiment;
Fig. 3 is according to the data processing method flow diagram for Chinese public sentiment in the application 3rd embodiment;
Fig. 4 is according to the data processing method flow diagram for Chinese public sentiment in the application fourth embodiment;
Fig. 5 is according to the data processing method flow diagram for Chinese public sentiment in the 5th embodiment of the application;
Fig. 6 is according to the data processing equipment structural diagram for Chinese public sentiment in the application first embodiment;
Fig. 7 is according to the data processing equipment structural diagram for Chinese public sentiment in the application second embodiment;
Fig. 8 is according to the data processing equipment structural diagram for Chinese public sentiment in the application 3rd embodiment;
Fig. 9 is according to the data processing equipment structural diagram for Chinese public sentiment in the application fourth embodiment;
Figure 10 is according to the data processing equipment structural diagram for Chinese public sentiment in the 5th embodiment of the application.
Specific embodiment
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application
Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only
The embodiment of the application a part, instead of all the embodiments.Based on the embodiment in the application, ordinary skill people
Member's every other embodiment obtained without making creative work, all should belong to the model of the application protection
It encloses.
It should be noted that term " includes " and " tool in the description and claims of this application and above-mentioned attached drawing
Have " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing a series of steps or units
Process, method, system, product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include without clear
Other step or units listing to Chu or intrinsic for these process, methods, product or equipment.
The data processing method for Chinese public sentiment in the application, is instructed respectively by using the training corpus of specific area
The term vector for practicing the field, the public sentiment hot in the field can be more accurately extracted using the term vector of this specialized training.This
Outside, by long-term accumulation, very valuable non-key dictionary is established, helps to solve redundancy occur in hot spot extraction process
Or the problem of interference information.And information source weighing computation method and corresponding weight are established for the whole network information, so that
Hot spot extracts and tracking result is apparent, more acurrate.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase
Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
As shown in Figure 1, this method includes the following steps, namely S102 to step S108:
Step S102 determines the term vector in default field;
Default field can be designated field or specific area, and designated field can be determined according to different user demands, special
Determining field can determine according to different industries feature.It is not defined in embodiments herein, as long as can satisfy really
Determine the requirement in field.
The term vector is the vector characterization of the word in natural language processing.
Specifically, it is determined that when term vector in specific area needing that specific area corpus is used to obtain as training data
Term vector model.And in the test process to the term vector model, expert/length in the specific area is invited
Phase practitioner determines term vector dimension and window size after comparing to the training result of different parameters the two are crucial
Training parameter (or result using machine training), finally obtains the term vector of specific area.
The term vector in the field is respectively trained using the training corpus of specific area, can more accurately extract the field
Public sentiment hot.
Step S104 filters out keyword in default dictionary according to the term vector in the default field and generates key
Word score;
According to the term vector in the default field obtained in above-mentioned steps, institute can be filtered out in preset dictionary
It states keyword and then generates the keyword score after giving a mark to the obtained keyword.
The default dictionary can be selected according to specific area.Specifically, it configures the default dictionary to
When non-key dictionary, it can be used for solving the problems, such as redundancy or interference information occur in the extraction process of Chinese public sentiment hot.
Step S106 obtains text data source weight;
Text data source weight needs to obtain in advance as the parameter for calculating hot value.It should be noted that
Text data source weight is not limited specifically in embodiments herein, as long as can satisfy acquisition textual data
According to the condition of source weight.
Step S108 extracts Chinese public sentiment hot by text data source weight and the keyword score.
The text data source weight and corresponding keyword score obtained in through the above steps, can be accurately
Extraction obtains Chinese public sentiment hot.It may include the words of hot spot belonging to multiple text datas in the Chinese public sentiment hot
Topic.
It can be seen from the above description that the application realizes following technical effect:
In the embodiment of the present application, by the way of determining the term vector in default field, according in the default field
Term vector filter out keyword in default dictionary and generate keyword score;By obtaining text data source weight, reach
The purpose that Chinese public sentiment hot is extracted by text data source weight and the keyword score is arrived, to realize
The technical effect that public sentiment hot is precisely extracted, and then the technical problem that the treatment effect that solves Chinese public sentiment is poor.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in Fig. 2, according in the default field
Term vector filters out keyword and generates keyword score in default dictionary
Step S202 extracts keyword according to the word frequency and text size occurred in the text, and by presetting non-pass
Keyword library removes non-key word;
The word frequency occurred in the text refers to, the frequency that keyword occurs in the text.Specifically, text can be
The length of one article, the probability and this article that are occurred according to keyword in an article extracts keyword.
According to the keyword that extraction obtains, non-key word can be removed by presetting non-key dictionary.Specifically, it compares
Non-key dictionary rejects the stop word of extraction.
Simultaneously, it should be noted that in order to guarantee the effect rejected, the non-key dictionary needs dynamic according to operation result
State updates, in this application and without specifically limiting, as long as can satisfy the update condition of non-key dictionary.
Step S204 merges the keyword occurred in the text data in target network-wide basis according to the term vector
To homogeneity keyword;
According to the term vector by the key of the text data in the target network-wide basis extracted in previous step
Word merges rear available homogeneity keyword.
Specifically, the homogeneity keyword refers to that text is variant, but word similar in the meaning expressed.By using word to
Amount can merge the operation of homogeneity keyword.For example, " working " keyword and " work " keyword, belong to homogeneity keyword
It needs to merge.
It should be noted that the concrete mode for using term vector to merge homogeneity keyword in embodiments herein not
It is defined, as long as can satisfy merging condition.
Step S206, the length of the frequency and position and the text that are occurred in the text according to the homogeneity keyword,
Calculate the score of the homogeneity keyword;
The frequency that is occurred in the text according to the homogeneity keyword and in the text appearance position, in conjunction with the length of the text
Degree, can calculate the score of the homogeneity keyword of acquisition.
Specifically, text can be an article.The position of the frequency, appearance that are occurred in article according to homogeneity keyword
And the length of article, calculate homogeneity keyword score.For example, the position in article can be in literary first, text or the end of writing.
Step S208, according to the score of the homogeneity keyword, arrangement obtains having to reserved portion in the text
Homogeneity lists of keywords.
Arrangement, which obtains having in the text, can visualize the homogeneity lists of keywords of reserved portion.
Specifically, text can be an article.It is arranged by the homogeneity keyword of the available every article of above-mentioned calculating
Table and its corresponding score.
In above-mentioned steps, by combining the term vector of conventional machines study, non-key dictionary, specific area, homogeneity is carried out
The extraction of keyword can accurately obtain the score and keyword of keyword in multiple text datas.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 3, obtaining text data source weight
Include:
Step S302, according to the average amount of reading of the text data in priori knowledge and the text data source to described
The marking of text data source obtains initialization text data source weight;
The priori knowledge can be knowledge known to expert or technical staff in specific area.
The specific area for needing to acquire larger data concentration to text data source marking according to priori knowledge
In expert or technical staff marking result.
The average amount of reading of the text data is only used as one of the present embodiment preferred embodiment, in the application
Embodiment in and without specifically limiting, as long as can satisfy marking require.
Specifically, when determining the weight in the text data source, text data source weight is given a mark by domain expert
(data weighting is from scratch) is cold-started with the average amount of reading of text data source.
Step S304 obtains the hot spot frequency of occurrence in the text data source so that by the text data come
The hot spot frequency of occurrence in source carries out dynamic regulation to the initialization text data source weight.
If the hot spot frequency of occurrence in the text data source is more, the weight of text data source is also got over
Greatly.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 4, passing through the text data source
Weight and the keyword score, extracting Chinese public sentiment hot includes:
Step S402, by it is described initialization text data source weight carry out dynamic regulation as a result, determine described in
Text data source weight;
The textual data can be determined by the result that the initialization text data source weight carries out dynamic regulation
According to the weight in source.
Step S404 determines that the text that occurs in the text data is variant but the identical keyword score of meaning;
The identical keyword of variant but meaning by the text occurred in the text data, can obtain to the pass
The marking result of keyword.Preferably, the text is variant but the identical keyword of meaning can be homogeneity keyword and
It is screened by non-key word database.
Step S406 obtains topic belonging to text according to text data source weight and the keyword score
And the corresponding hot value of the topic.
According to text data source weight and the keyword score, the affiliated topic of the available text and institute
The corresponding hot value of topic is stated, specifically, such as can be using topic label as the list of homogeneity keyword, so as to show text
Topic and corresponding hot value belonging to this.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 5, passing through the text data source
Weight and the keyword score, after extracting Chinese public sentiment hot further include: the step of the tracking Chinese public sentiment hot, institute
Stating the step of tracking the Chinese public sentiment hot includes:
Step S502, according to the key of the identical meanings occurred in different text datas in public sentiment data database
The hot value of word is added to obtain total score value;
The public sentiment data library can be based on the built-in vertical Chinese whole network data library of the longer time limit, may be implemented in real time
Cleaning and dynamic update.Public sentiment data library storage content includes data source, time, content of text, fields etc..
It should be noted that in embodiments herein, for the public sentiment data library and without specifically defined,
As long as can satisfy public sentiment data library establishes maintenance requirement.
In the public sentiment data database, according to the keyword of the identical meanings occurred in different text datas
Hot value carries out being added available total score value after being added.
Step S504 determines the Chinese public sentiment hot to be tracked according to the total score value;
The Chinese wait track (customer demand or industry requirement) is determined according to the total score value obtained in last step
Public sentiment hot.
Step S506, it includes identical as the keyword senses for extracting in the text data in target network-wide basis
Keyword, and the hot value of the keyword be greater than preset threshold the text data.
Preset threshold can be selected according to specific area, in embodiments herein and without specifically limiting.
Specifically, by reading in public sentiment data library in designated field, all data within the scope of specified time, heat is used
Degree parser obtains the homogeneity lists of keywords and score of every article, the heat of homogeneity keyword identical in different articles
Angle value is added, and the hot spot for needing to be tracked in the time range is determined according to score value.
The designated field, specified time range, can determine according to demand, such as return in specific area from current time
The result in one hour traced back.
Homogeneity keyword having the same in whole network data is extracted, and article hot value is greater than the article of specified threshold,
Those articles are arranged according to time tag, obtain public sentiment tracking as a result, being visualized.
When obtaining the hot value of the article, it is also necessary to be pressed according to hot value obtained in specifically used temperature calculation method
Length apart from current time decays.
It should be noted that step shown in the flowchart of the accompanying drawings can be in such as a group of computer-executable instructions
It is executed in computer system, although also, logical order is shown in flow charts, and it in some cases, can be with not
The sequence being same as herein executes shown or described step.
According to the embodiment of the present application, additionally provide a kind of for implementing the data processing for Chinese public sentiment of the above method
Device, as shown in fig. 6, the device comprises determining that module 10, for determining the term vector in default field;Screening module 20 is used
In filtering out keyword in default dictionary according to the term vector in the default field and generate keyword score;Obtain module
30, for obtaining text data source weight;And extraction module 40, for passing through text data source weight and described
Keyword score extracts Chinese public sentiment hot.
Field is preset in the determining module 10 of the embodiment of the present application can be designated field or specific area, and designated field can
To be determined according to different user demands, specific area can be determined according to different industries feature.In embodiments herein simultaneously
Without limiting, as long as can satisfy the requirement in determining field.
The term vector is the vector characterization of the word in natural language processing.
Specifically, it is determined that when term vector in specific area needing that specific area corpus is used to obtain as training data
Term vector model.And in the test process to the term vector model, expert/length in the specific area is invited
Phase practitioner determines term vector dimension and window size after comparing to the training result of different parameters the two are crucial
Training parameter (or result using machine training), finally obtains the term vector of specific area.
The term vector in the field is respectively trained using the training corpus of specific area, can more accurately extract the field
Public sentiment hot.
According to the term vector in the default field obtained in above-mentioned steps in the screening module 20 of the embodiment of the present application,
The keyword can be filtered out in preset dictionary and then generates the pass after giving a mark to the obtained keyword
Keyword score.
The default dictionary can be selected according to specific area.Specifically, it configures the default dictionary to
When non-key dictionary, it can be used for solving the problems, such as redundancy or interference information occur in the extraction process of Chinese public sentiment hot.
Text data source weight described in the acquisition module 30 of the embodiment of the present application is needed as the parameter for calculating hot value
It to obtain in advance.
It should be noted that not limited specifically text data source weight in embodiments herein
It is fixed, as long as can satisfy the condition for obtaining text data source weight.
In the extraction module 40 of the embodiment of the present application through the above steps in obtain text data source weight and
Corresponding keyword score, can accurately extract to obtain Chinese public sentiment hot.It may include more in the Chinese public sentiment hot
Hot topic belonging to a text data.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in fig. 7, the screening module 20 includes: to mention
Unit 201 is taken, for extracting keyword according to the word frequency and text size occurred in the text, and by presetting non-key word
Library removes non-key word;Combining unit 202, for will be occurred in the text data in target network-wide basis according to the term vector
Keyword merge to obtain homogeneity keyword;Score calculation unit 203, for being occurred in the text according to the homogeneity keyword
Frequency and position and the text length, calculate the score of the homogeneity keyword;List cell 204, for according to institute
The score of homogeneity keyword is stated, arrangement obtains the homogeneity lists of keywords having to reserved portion in the text.
The word frequency occurred in text described in the extraction unit 201 of the embodiment of the present application refers to, keyword in the text
The frequency of appearance.Specifically, text can be an article, the probability and this text occurred according to keyword in an article
The length of chapter extracts keyword.
According to the keyword that extraction obtains, non-key word can be removed by presetting non-key dictionary.Specifically, it compares
Non-key dictionary rejects the stop word of extraction.
Simultaneously, it should be noted that in order to guarantee the effect rejected, the non-key dictionary needs dynamic according to operation result
State updates, in this application and without specifically limiting, as long as can satisfy the update condition of non-key dictionary.
The mesh that will be extracted in previous step in the combining unit 202 of the embodiment of the present application according to the term vector
The keyword of text data in mark network-wide basis merges rear available homogeneity keyword.
Specifically, the homogeneity keyword refers to that text is variant, but word similar in the meaning expressed.By using word to
Amount can merge the operation of homogeneity keyword.For example, " working " keyword and " work " keyword, belong to homogeneity keyword
It needs to merge.
It should be noted that the concrete mode for using term vector to merge homogeneity keyword in embodiments herein not
It is defined, as long as can satisfy merging condition.
The frequency that is occurred in the text in the score calculation unit 203 of the embodiment of the present application according to the homogeneity keyword and
Appearance position in the text can calculate the score for obtaining the homogeneity keyword in conjunction with the length of the text.
Specifically, text can be an article.The position of the frequency, appearance that are occurred in article according to homogeneity keyword
And the length of article, calculate homogeneity keyword score.For example, the position in article can be in literary first, text or the end of writing.
Arrangement obtains crucial with the homogeneity to reserved portion in the text in the list cell 204 of the embodiment of the present application
Word list can be visualized.
Specifically, text can be an article.It is arranged by the homogeneity keyword of the available every article of above-mentioned calculating
Table and its corresponding score.
In above-mentioned steps, by combining the term vector of conventional machines study, non-key dictionary, specific area, homogeneity is carried out
The extraction of keyword can accurately obtain the score and keyword of keyword in multiple text datas.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 8, the acquisition module 30 includes: just
Beginningization unit 301, for according to the average amount of reading of the text data in priori knowledge and the text data source to described
The marking of text data source obtains initialization text data source weight;Dynamic adjustment unit 302, for obtaining the text
Hot spot frequency of occurrence in data source, so that by the hot spot frequency of occurrence in the text data source to the initialization text
Notebook data source weight carries out dynamic regulation.
Priori knowledge described in the initialization unit 301 of the embodiment of the present application can be expert or technology in specific area
Knowledge known to personnel.
The specific area for needing to acquire larger data concentration to text data source marking according to priori knowledge
In expert or technical staff marking result.
The average amount of reading of the text data is only used as one of the present embodiment preferred embodiment, in the application
Embodiment in and without specifically limiting, as long as can satisfy marking require.
Specifically, when determining the weight in the text data source, text data source weight is given a mark by domain expert
(data weighting is from scratch) is cold-started with the average amount of reading of text data source.
If the hot spot frequency of occurrence in the text data source is got in the dynamic adjustment unit 302 of the embodiment of the present application
More, the weight of text data source is also bigger.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in figure 9, the extraction module 40 includes: power
Weight unit 401, for by the initialization text data source weight carry out dynamic regulation as a result, determining the text
Data source weight;Homogeneity keyword unit 402, but meaning variant for the determining text occurred in the text data
Identical keyword score;Hot value calculates 403, is used for according to text data source weight and the keyword score,
Obtain topic and the corresponding hot value of the topic belonging to text.
Dynamic regulation is carried out by the initialization text data source weight in the weight unit 401 of the embodiment of the present application
Result can determine the weight in the text data source.
It is variant by the text occurred in the text data in the homogeneity keyword unit 402 of the embodiment of the present application
But the identical keyword of meaning can obtain the marking result to the keyword.Preferably, the text is variant but meaning phase
Same keyword can be homogeneity keyword and have already passed through non-key word database screening.
The hot value of the embodiment of the present application, which calculates, to be obtained in 403 according to text data source weight and the keyword
Point, the available affiliated topic of text and the corresponding hot value of the topic, specifically, for example, can using topic label as
The list of homogeneity keyword, so as to the affiliated topic of display text and corresponding hot value.
According to the embodiment of the present application, as preferred in the present embodiment, as shown in Figure 10, further includes: tracing module 50, institute
Stating tracing module 50 includes: score unit 501, in public sentiment data database according to the phase occurred in different text datas
Hot value with the keyword of meaning is added to obtain total score value;Determination unit 502 is used for according to the total score value,
Determine the Chinese public sentiment hot to be tracked;Extraction unit 503, for extracting the text data in target network-wide basis
In include keyword identical with the keyword senses, and the hot value of the keyword is greater than the text of preset threshold
Data.
Public sentiment data library described in the score unit 501 of the embodiment of the present application can be built-in vertical based on the longer time limit
Chinese whole network data library may be implemented to clean in real time and dynamic update.Public sentiment data library storage content include data source, when
Between, content of text, fields etc..
It should be noted that in embodiments herein, for the public sentiment data library and without specifically defined,
As long as can satisfy public sentiment data library establishes maintenance requirement.
In the public sentiment data database, according to the keyword of the identical meanings occurred in different text datas
Hot value carries out being added available total score value after being added.
It is determined according to the total score value obtained in last step wait track (visitor in the determination unit 502 of the embodiment of the present application
Family demand or industry requirement) the Chinese public sentiment hot.
It is extracted in the extraction unit 503 of the embodiment of the present application in the text data in target network-wide basis and includes
There is keyword identical with the keyword senses, and the hot value of the keyword is greater than the text data of preset threshold.
Preset threshold can be selected according to specific area, in embodiments herein and without specifically limiting.
Specifically, by reading in public sentiment data library in designated field, all data within the scope of specified time, heat is used
Degree parser obtains the homogeneity lists of keywords and score of every article, the heat of homogeneity keyword identical in different articles
Angle value is added, and the hot spot for needing to be tracked in the time range is determined according to score value.
The designated field, specified time range, can determine according to demand, such as return in specific area from current time
The result in one hour traced back.
Homogeneity keyword having the same in whole network data is extracted, and the hot value of homogeneity keyword is greater than specified threshold
Article arranges those articles according to time tag, obtains public sentiment tracking as a result, being visualized.
When obtaining the hot value of the article, it is also necessary to be pressed according to hot value obtained in specifically used temperature calculation method
Length apart from current time decays.
Obviously, those skilled in the art should be understood that each module of above-mentioned the application or each step can be with general
Computing device realize that they can be concentrated on a single computing device, or be distributed in multiple computing devices and formed
Network on, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored
Be performed by computing device in the storage device, perhaps they are fabricated to each integrated circuit modules or by they
In multiple modules or step be fabricated to single integrated circuit module to realize.In this way, the application be not limited to it is any specific
Hardware and software combines.
The foregoing is merely preferred embodiment of the present application, are not intended to limit this application, for the skill of this field
For art personnel, various changes and changes are possible in this application.Within the spirit and principles of this application, made any to repair
Change, equivalent replacement, improvement etc., should be included within the scope of protection of this application.
Claims (10)
1. a kind of data processing method for Chinese public sentiment characterized by comprising
Determine the term vector in default field;
Keyword is filtered out in default dictionary according to the term vector in the default field and generates keyword score;
Obtain text data source weight;And
By text data source weight and the keyword score, Chinese public sentiment hot is extracted.
2. the data processing method according to claim 1 for Chinese public sentiment, which is characterized in that according to the default neck
Term vector in domain filters out keyword and generate keyword score in default dictionary includes:
Keyword is extracted according to the word frequency and text size occurred in the text, and is gone by presetting non-key dictionary unless closing
Keyword;
The keyword occurred in the text data in target network-wide basis is merged according to the term vector to obtain homogeneity keyword;
The length of the frequency and position and the text that are occurred in the text according to the homogeneity keyword calculates the homogeneity and closes
The score of keyword;
According to the score of the homogeneity keyword, arrangement obtains arranging in the text with the homogeneity keyword to reserved portion
Table.
3. the data processing method according to claim 1 for Chinese public sentiment, which is characterized in that obtain text data
Source weight includes:
According to the average amount of reading of the text data in priori knowledge and the text data source to the text data source
Marking obtains initialization text data source weight;
The hot spot frequency of occurrence in the text data source is obtained, so as to go out occurrence by the hot spot in the text data source
It is several that dynamic regulation is carried out to the initialization text data source weight.
4. the data processing method according to claim 1 for Chinese public sentiment, which is characterized in that pass through the textual data
According to source weight and the keyword score, extracting Chinese public sentiment hot includes:
By to the initialization text data source weight carry out dynamic regulation as a result, determining that the text data source is weighed
Weight;
Determine that the text that occurs in the text data is variant but the identical keyword score of meaning;
According to text data source weight and the keyword score, topic and the topic pair belonging to text are obtained
The hot value answered.
5. the data processing method according to claim 1 for Chinese public sentiment, which is characterized in that pass through the textual data
According to source weight and the keyword score, after the Chinese public sentiment hot of extraction further include: the tracking Chinese public sentiment hot
The step of step, the tracking Chinese public sentiment hot includes:
According to the hot value phase of the keyword of the identical meanings occurred in different text datas in public sentiment data database
Add to obtain total score value;
According to the total score value, the Chinese public sentiment hot to be tracked is determined;
Extracting in the text data in target network-wide basis includes keyword identical with the keyword senses, and should
The hot value of keyword is greater than the text data of preset threshold.
6. a kind of data processing equipment for Chinese public sentiment characterized by comprising
Determining module, for determining the term vector in default field;
Screening module, for filtering out keyword in default dictionary according to the term vector in the default field and generating key
Word score;
Module is obtained, for obtaining text data source weight;And
Extraction module, for extracting Chinese public sentiment hot by text data source weight and the keyword score.
7. the data processing equipment according to claim 6 for Chinese public sentiment, which is characterized in that the screening module packet
It includes:
Extraction unit, for extracting keyword according to the word frequency and text size occurred in the text, and by presetting non-pass
Keyword library removes non-key word;
Combining unit, for being merged the keyword occurred in the text data in target network-wide basis according to the term vector
To homogeneity keyword;
Score calculation unit, the length of frequency and position and the text for being occurred in the text according to the homogeneity keyword
Degree, calculates the score of the homogeneity keyword;
List cell, for the score according to the homogeneity keyword, arrangement obtains having to reserved portion in the text
Homogeneity lists of keywords.
8. the data processing equipment according to claim 6 for Chinese public sentiment, which is characterized in that the acquisition module packet
It includes:
Initialization unit, for according to the average amount of reading of the text data in priori knowledge and the text data source to institute
The marking of text data source is stated, initialization text data source weight is obtained;
Dynamic adjustment unit, for obtaining the hot spot frequency of occurrence in the text data source, so as to pass through the textual data
Dynamic regulation is carried out to the initialization text data source weight according to the hot spot frequency of occurrence in source.
9. the data processing equipment according to claim 6 for Chinese public sentiment, which is characterized in that the extraction module packet
It includes:
Weight unit, for by it is described initialization text data source weight carry out dynamic regulation as a result, determine described in
Text data source weight;
Homogeneity keyword unit, but meaning identical keyword variant for the determining text occurred in the text data
Score;
Hot value calculates, for obtaining belonging to text according to text data source weight and the keyword score
Topic and the corresponding hot value of the topic.
10. the data processing equipment according to claim 6 for Chinese public sentiment, which is characterized in that further include: tracking mould
Block, the tracing module include:
Score unit, for the key in public sentiment data database according to the identical meanings occurred in different text datas
The hot value of word is added to obtain total score value;
Determination unit, for determining the Chinese public sentiment hot to be tracked according to the total score value;
Extraction unit in the text data in target network-wide basis includes identical as the keyword senses for extracting
Keyword, and the hot value of the keyword be greater than preset threshold the text data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910213894.5A CN109948161A (en) | 2019-03-20 | 2019-03-20 | Data processing method and device for Chinese public sentiment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910213894.5A CN109948161A (en) | 2019-03-20 | 2019-03-20 | Data processing method and device for Chinese public sentiment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109948161A true CN109948161A (en) | 2019-06-28 |
Family
ID=67011133
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910213894.5A Pending CN109948161A (en) | 2019-03-20 | 2019-03-20 | Data processing method and device for Chinese public sentiment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109948161A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111046264A (en) * | 2019-11-29 | 2020-04-21 | 江西省天轴通讯有限公司 | Public opinion cue processing method, system, readable storage medium and computer equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160180245A1 (en) * | 2014-12-19 | 2016-06-23 | Medidata Solutions, Inc. | Method and system for linking heterogeneous data sources |
CN107665222A (en) * | 2016-07-29 | 2018-02-06 | 北京国双科技有限公司 | The expanding method and device of keyword |
CN108959383A (en) * | 2018-05-31 | 2018-12-07 | 平安科技(深圳)有限公司 | Analysis method, device and the computer readable storage medium of network public-opinion |
CN109145216A (en) * | 2018-08-29 | 2019-01-04 | 中国平安保险(集团)股份有限公司 | Network public-opinion monitoring method, device and storage medium |
CN109241429A (en) * | 2018-09-05 | 2019-01-18 | 食品安全与营养(贵州)信息科技有限公司 | A kind of food safety public sentiment monitoring method and system |
-
2019
- 2019-03-20 CN CN201910213894.5A patent/CN109948161A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160180245A1 (en) * | 2014-12-19 | 2016-06-23 | Medidata Solutions, Inc. | Method and system for linking heterogeneous data sources |
CN107665222A (en) * | 2016-07-29 | 2018-02-06 | 北京国双科技有限公司 | The expanding method and device of keyword |
CN108959383A (en) * | 2018-05-31 | 2018-12-07 | 平安科技(深圳)有限公司 | Analysis method, device and the computer readable storage medium of network public-opinion |
CN109145216A (en) * | 2018-08-29 | 2019-01-04 | 中国平安保险(集团)股份有限公司 | Network public-opinion monitoring method, device and storage medium |
CN109241429A (en) * | 2018-09-05 | 2019-01-18 | 食品安全与营养(贵州)信息科技有限公司 | A kind of food safety public sentiment monitoring method and system |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111046264A (en) * | 2019-11-29 | 2020-04-21 | 江西省天轴通讯有限公司 | Public opinion cue processing method, system, readable storage medium and computer equipment |
CN111046264B (en) * | 2019-11-29 | 2023-07-21 | 江西省天轴通讯有限公司 | Public opinion cue processing method, system, readable storage medium and computer device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108376160B (en) | Chinese knowledge graph construction method and system | |
CN108287864B (en) | Interest group dividing method, device, medium and computing equipment | |
CN106294535B (en) | The recognition methods of website and device | |
CN108959575B (en) | A kind of enterprise's incidence relation information mining method and device | |
CN110059271A (en) | With the searching method and device of label knowledge network | |
CN106959966A (en) | A kind of information recommendation method and system | |
CN104699711B (en) | A kind of recommended method and server | |
CN103577549A (en) | Crowd portrayal system and method based on microblog label | |
CN103365904B (en) | A kind of advertising message searching method and system | |
CN106934071A (en) | Recommendation method and device based on Heterogeneous Information network and Bayes's personalized ordering | |
CN104462327B (en) | Calculating, search processing method and the device of statement similarity | |
CN105022754A (en) | Social network based object classification method and apparatus | |
CN105095625B (en) | Clicking rate prediction model method for building up, device and information providing method, system | |
CN107679103B (en) | Attribute analysis method and system for entity | |
CN105740404A (en) | Label association method and device | |
CN110458641A (en) | A kind of electric business recommended method and system | |
CN108564429A (en) | A kind of cuisines shops recommendation method based on deep learning | |
CN110119478A (en) | A kind of item recommendation method based on similarity of a variety of user feedback datas of combination | |
CN112035449A (en) | Data processing method and device, computer equipment and storage medium | |
CN108509545A (en) | A kind of comment processing method and system of article | |
CN104331490B (en) | network data processing method and device | |
CN111008329A (en) | Page content recommendation method and device based on content classification | |
CN111125561A (en) | Network heat display method and device | |
CN109948161A (en) | Data processing method and device for Chinese public sentiment | |
CN107426610A (en) | Video information synchronous method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20200407 Address after: 100041 403, floor 4, building 10, yard 30, Shixing street, Shijingshan District, Beijing Applicant after: BEIJING YULESHIJIE EDUCATION TECHNOLOGY Co.,Ltd. Address before: 100095 Beijing Haidian District Baijiatuan Shangpin Garden 2 Floor 205 Applicant before: BEIJING SHENHAI JUJING INFORMATION TECHNOLOGY Co.,Ltd. |
|
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190628 |