CN111931032A - Public opinion event discovery method and device and computing equipment - Google Patents

Public opinion event discovery method and device and computing equipment Download PDF

Info

Publication number
CN111931032A
CN111931032A CN202010658727.4A CN202010658727A CN111931032A CN 111931032 A CN111931032 A CN 111931032A CN 202010658727 A CN202010658727 A CN 202010658727A CN 111931032 A CN111931032 A CN 111931032A
Authority
CN
China
Prior art keywords
event
document
period
public sentiment
browsing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010658727.4A
Other languages
Chinese (zh)
Inventor
李鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHEZHI HULIAN (BEIJING) SCIENCE & TECHNOLOGY CO LTD
Original Assignee
CHEZHI HULIAN (BEIJING) SCIENCE & TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHEZHI HULIAN (BEIJING) SCIENCE & TECHNOLOGY CO LTD filed Critical CHEZHI HULIAN (BEIJING) SCIENCE & TECHNOLOGY CO LTD
Priority to CN202010658727.4A priority Critical patent/CN111931032A/en
Publication of CN111931032A publication Critical patent/CN111931032A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a public sentiment event discovery method, which is suitable for being executed in computing equipment, wherein the computing equipment is connected with a data storage device, the data storage device is suitable for storing a plurality of document contents, and each document content is associated with browsing times, and the method comprises the following steps: clustering a plurality of document contents in a period with a preset time length to generate at least one document category, wherein all the document contents in each document category correspond to the same event; acquiring the browsing times corresponding to a target event based on the browsing times associated with the document content, wherein the target event is an event corresponding to any document category in at least one document category; carrying out growth curve fitting on the historical browsing times of the target event; and when the first characteristic time point of the growth curve obtained by fitting is larger than zero, identifying the target event as a public sentiment event, and determining that the public sentiment event enters an outbreak period. The invention also discloses corresponding computing equipment and a storage medium.

Description

Public opinion event discovery method and device and computing equipment
Technical Field
The invention relates to the technical field of information processing, in particular to a public sentiment event discovery method, a public sentiment event discovery device and computing equipment.
Background
With the rapid increase of the number of internet users and the development of multimedia photographing, shooting and internet surfing functions, the influence of real-time property, interactivity and randomness of the internet is caused, once an emergency in each industry is exposed, information of the emergency can rapidly trigger network public opinion, how to automatically monitor the public opinion events is an important proposition for maintaining the stability of the industry, and two main research methods for the public opinion events are provided at present.
A method is to judge the public sentiment evolution stage based on the influence factors of the public sentiment evolution, which utilizes historical data to backtrack the development change trend of the public sentiment data indexes corresponding to the public sentiment events according to different event types, thereby providing a certain experience judgment basis for the occurrence of the future public sentiment events of the same type. However, the subjective factors of the method are strong, evolution stages obtained by different researchers are possibly different, and for the types of public sentiment events which do not appear in history, it is difficult to find historical data for backtracking, so that reasonable prediction on development and evolution of the types of public sentiment events cannot be performed, the applicable scenes are limited, and the actual characteristics of the internet public sentiment diversification are difficult to meet.
The other method is based on a simulation model of a cellular automaton, and utilizes a dynamic model to simulate the change relation of the number of individuals and the number of speeches related to the time in the forming process of a certain network public sentiment, however, the simulation modeling method needs to carry out independent simulation prediction on each public sentiment event in actual service, so that the cost is high, and the timeliness is not strong.
Disclosure of Invention
To this end, the present invention provides a public opinion event discovery method, a computing device and a storage medium in an effort to solve or at least alleviate at least one of the problems presented above.
According to an aspect of the present invention, there is provided a public opinion event discovery method, the method being adapted to be executed in a computing device, the computing device being connected to a data storage, the data storage being adapted to store a plurality of document contents, each document content being associated with a browsing number, the method comprising the steps of: clustering a plurality of document contents in a period with a preset time length to generate at least one document category, wherein all the document contents in each document category correspond to the same event; acquiring the browsing times corresponding to a target event based on the browsing times associated with the document content, wherein the target event is an event corresponding to any document category in at least one document category; carrying out growth curve fitting on the historical browsing times of the target event; and when the first characteristic time point of the growth curve obtained by fitting is larger than zero, identifying the target event as a public sentiment event, and determining that the public sentiment event enters an outbreak period.
Optionally, in the public opinion event discovery method according to the present invention, when the second characteristic time points of the fitted growth curve are all greater than a preset value for a predetermined number of consecutive cycles, it is determined that the public opinion event enters a calm period.
Optionally, in the public opinion event discovery method according to the present invention, before clustering the plurality of documents at predetermined time intervals, further comprising the steps of: the plurality of documents are filtered to remove documents that are not negative content.
Optionally, in the public opinion event discovery method according to the present invention, all document contents under each document category correspond to the same event, the document contents further include a document title, and after at least one document category is generated, the method further includes: and acquiring the associated browsing times of all the document contents in each document category, and taking the title of the document content with the highest associated browsing time as the corresponding event topic name.
Optionally, in the public opinion event discovery method according to the present invention, the obtaining of the browsing times corresponding to the target event based on the browsing times associated with the document content includes: and acquiring the associated browsing times of all the document contents in each document type, and adding to obtain the browsing times corresponding to the target event.
Alternatively, in the public opinion event discovery method according to the present invention, the public opinion event entering the outbreak period means a period in which the public opinion event has just occurred from the event, the number of browsing persons slowly changes from zero, and a period in which the number of browsing persons rapidly increases.
Alternatively, in the public opinion event discovering method according to the present invention, the public opinion event entering the calm period means that the number of browsing people is increased from a stage of rapid increase after the public opinion event is exploded from the event to a stage of slow increase until the browsing people is approached to be unchanged.
Optionally, in the public opinion event discovery method according to the present invention, the growth curve is a Gompertz curve.
Optionally, in the public opinion event discovery method according to the present invention, clustering a plurality of document contents for a predetermined period of time to generate at least one document category includes the steps of: for each document content in the document contents, performing word segmentation processing on the document content to obtain a plurality of words, and calculating a word vector of each word; calculating the weight of each participle, and determining a keyword based on the weight of the participle; acquiring a text vector of the document content according to the word vector and the weight of the keyword; and clustering according to the text vector of each document content so as to divide the document contents with high text vector similarity into the same document category.
Optionally, in the public opinion event discovery method according to the present invention, the calculating the weight of each participle includes the steps of: and evaluating the importance degree of each participle to the document content by using a TF-IDF statistical method, wherein the higher the importance degree is, the higher the weight of the participle is.
Optionally, in the public opinion event discovery method according to the present invention, after clustering a plurality of document contents in a predetermined time period to generate at least one document category, the method further includes: when the next period starts, acquiring a plurality of newly added document contents, and judging whether the similarity of the text vector of each newly added document content and the text vector of the document contents under the current existing document category is smaller than a preset threshold value or not; if the document content is smaller than the preset threshold value, the document content is divided into the existing document categories.
According to still another aspect of the present invention, there is provided a public opinion event discovery apparatus, the apparatus being connected to a data storage device, the data storage device being adapted to store a plurality of document contents, each document content being associated with a browsing number, the apparatus comprising: the clustering module is used for clustering a plurality of document contents in a period with preset time length to generate at least one document category, wherein all the document contents in each document category correspond to the same event; the browsing frequency acquisition module is used for acquiring the browsing frequency corresponding to a target event according to the browsing frequency associated with the document content, wherein the target event is an event corresponding to any one of at least one document category; the growth curve fitting module is used for performing growth curve fitting on the historical browsing times of the target event; and the public sentiment event identification module is used for identifying the target event as a public sentiment event when the first characteristic time point of the growth curve obtained by fitting is greater than zero, and determining that the public sentiment event enters an outbreak period.
According to yet another aspect of the invention, there is provided a computing device comprising at least one processor; and a memory storing program instructions, wherein the program instructions are configured to be suitable for execution by the at least one processor, the program instructions comprising instructions for performing the public opinion event discovery method according to the present invention.
According to still another aspect of the present invention, there is provided a readable storage medium storing program instructions which, when read and executed by a computing device, cause the computing device to perform the public opinion event discovery method of the present invention.
According to the technical scheme of the invention, a large amount of user production contents on the Internet are analyzed, one production content is a document content, a plurality of document contents are clustered in one period to obtain at least one document category, and all document contents under each document category correspond to the same event, so that support is provided for the discovery of public sentiments with different themes in the multi-service field; and then acquiring browsing times corresponding to the target event according to the correlation times of the document contents under the document category corresponding to the target event, fitting a growth curve of the historical browsing times of the target event, wherein the growth curve is established according to the public sentiment propagation rule and comprises an outbreak period and a quiet period, judging whether the event enters the outbreak period or not according to a first characteristic time point of the growth curve obtained by fitting, and the event which does not enter the outbreak period does not need to be processed, so that the public sentiment event is automatically supervised.
Wherein, the public sentiment event entering the outbreak period refers to a stage that the public sentiment event just happens from the event and the number of browsing people slowly changes from zero to a stage that the number of browsing people rapidly increases; the public sentiment event entering the calm period means that the number of browsing people is from a stage of rapid increase after the public sentiment event is exploded to a stage of slow increase until the number of browsing people is nearly unchanged. By uniformly quantifying the basis of public sentiment event discovery and evolution prediction, time nodes at different stages can be predicted without independently carrying out simulation modeling on each public sentiment event, different coping processes are established at different stages for each industry, and dynamic and rapid decision support is provided.
Drawings
To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.
FIG. 1 shows a schematic diagram of a computing device 100, according to an embodiment of the invention;
fig. 2 illustrates a flowchart of a public opinion event discovery method 200 according to one embodiment of the present invention;
fig. 3 illustrates a schematic diagram of a public opinion event discovery apparatus 300 according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Fig. 1 is a block diagram of an example computing device 100. In a basic configuration 102, computing device 100 typically includes system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.
Depending on the desired configuration, the processor 104 may be any type of processor, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a Digital Signal Processor (DSP), or any combination thereof. The processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. The example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 118 may be used with the processor 104, or in some implementations the memory controller 118 may be an internal part of the processor 104.
Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 120, one or more applications 122, and program data 124. In some embodiments, application 122 may be arranged to operate with program data 124 on an operating system. In some embodiments, the computing device 100 is configured to perform a public opinion event discovery method, and the program data 124 includes instructions for performing the method 200.
With the rapid increase of the number of internet users and the development of multimedia photographing, shooting and internet surfing functions, the influence of real-time property, interactivity and randomness of the internet is caused, once an emergency in each industry is exposed, information of the emergency can rapidly trigger network public opinion, and how to automatically monitor the public opinion events is an important proposition for maintaining the stability of the industry. The public sentiment event refers to a local or accidental problem, and becomes a problem of large-scale discussion when the public sentiment event triggers the user interest demand or causes public sentiment, and at the moment, the network rumor and the irrational sound easily cause public opponent emotion and become a fire-guiding rope which excites contradiction and causes harm. The automatic supervision of public sentiment events can not only ensure the spreading public praise of products in users, but also be beneficial to constructing a harmonious network speech environment. The invention analyzes a large amount of user production contents on the Internet, and one production content is a document content. According to an embodiment of the present invention, when the public opinion event discovery method is executed, the computing device 100 is further connected to a data storage device (not shown in the figure), the data storage device is suitable for storing a plurality of document contents, each document content is associated with browsing times, and the data storage device usually resides in an enterprise user-oriented platform server. The server is connected with a plurality of clients, collects the content and related information produced by the client users on the platform website, and stores the collected information in the data storage device. The plurality of document contents stored in the data storage device are different based on the interactive platform form of each enterprise facing the internet users, for example, if the interactive mode is performed based on forum websites of the industry, the comment contents of the users on the forum websites are obtained, and it needs to be pointed out that the browsing times are associated with each document content regardless of the specific format of the document content.
Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, image input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164.
A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media. In some embodiments, one or more programs are stored in a computer readable medium, the one or more programs including instructions for performing certain methods, such as the computing device 100 executing the public opinion event discovery method 200 according to the present invention through the instructions according to the embodiments of the present invention.
The computing device 100 is installed with a mobile APP or client application supporting network file transmission and storage, including a native application or a browser such as IE, Chrome, and Firefox, or a wechat, QQ, and so on, and stores various files locally, such as photos, audio, video, and documents (e.g., documents in Word, PDF, and so on). The application client may run on an operating system such as Windows, MacOS, android, and the like. Computing device 100 may be implemented as part of a small-form factor portable (or mobile) electronic device such as a cellular telephone, a digital camera, a Personal Digital Assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Computing device 100 may also be implemented as a personal computer including both desktop and notebook computer configurations.
Fig. 2 illustrates a flowchart of a public opinion event discovery method 200 according to an embodiment of the present invention. The method 200 is suitable for execution in a computing device (such as the computing device 100 described above) that is coupled to a data storage device that is adapted to store a plurality of document contents, each document content being associated with a number of views. As shown in fig. 2, the public opinion event discovery method starts in step S210.
In step S210, a plurality of document contents are clustered in a predetermined time period, and at least one document category is generated. Wherein all document contents under each document category correspond to the same event. Since document content belongs to user production content and is accumulated more and more as time advances, analysis of public sentiment events is also based on the change of document content along with time. The method is carried out according to the set period, all the existing document contents are analyzed according to the period, the preset time length of the period can be set according to the service scene, for example, the evaluation feedback content of a user for a certain product does not change too much, and the time length of the period can be set to one day.
Taking a forum oriented to users in the automobile industry as an example, the period is set as one day, and data produced by users before the period is started is shown in table 1, including document content release date, document content id, document content, and document browsing number. It should be noted that the user data analyzed by the scheme can be classified according to the clustering method of the scheme, and the original data analyzed by the scheme can be obtained based on the classification under the interactive platform before clustering, for example, the user production data of forum in the automobile industry is classified by the automobile system, so that the effect of utilizing the scheme is to supervise all events under the automobile system, and is beneficial to service personnel to obtain information in a targeted manner.
TABLE 1
Date id Document content The number of browsing people
2019/7/10 0001 Skylight problem … 4
2019/7/15 0002 Whole system water leakage problem … of skylight 12
2019/7/15 0003 es vehicle with software upgrade … 4
2019/7/16 0004 Does es train also need to be added with price … 2
2019/7/17 0005 Post-evacuation problem … 2
2019/7/18 0006 Paint with flaw … 8
The user production content is divided into positive content and negative content, the positive content does not need supervision, therefore, according to an embodiment of the invention, before clustering processing is carried out on a plurality of documents at preset time intervals, the method comprises the following steps: the plurality of documents are filtered to remove documents that are not negative content. Non-negative content document removal may utilize event model recognition and filtering methods that are well established in the art and will not be described in detail herein. The data after treatment are shown in table 2.
TABLE 2
Date id Document content The number of browsing people
2019/7/10 0001 Skylight problem … 4
2019/7/15 0002 Whole system water leakage problem … of skylight 12
2019/7/17 0005 Post-evacuation problem … 2
2019/7/18 0006 Paint with flaw … 8
The clustering process is performed on the contents of a plurality of documents, and various existing clustering algorithms can be adopted. According to one embodiment of the invention, text vectors for document contents can be generated first, and then clustering is performed based on the similarity of the text vectors. For example, for each document content, performing word segmentation processing on the document content to obtain a plurality of segmented words, calculating a word vector and a weight of each segmented word, determining a keyword in the plurality of segmented words according to the weight, and obtaining a text vector of the document content according to the word vector and the keyword weight; and clustering according to the text vector of each document content so as to divide the document contents with high text vector similarity into the same document category.
Specifically, currently, there are three main methods for word segmentation in the field of NLP (natural Language processing): a word segmentation method based on character string matching, a word segmentation method based on understanding and a word segmentation method based on statistics. The method for training Word vectors is mainly based on a Word Bag model CBOW (Continuous Bag-of-Word) or a Skip-Word model (Skip-Word) in a language model, for example, the CBOW model mainly predicts the probability of the Word according to the context of the Word, thereby obtaining the vector expression of the Word.
The Term weight extraction can adopt a TF-IDF (Term Frequency-Inverse Document Frequency) method, wherein the TF-IDF is a statistical method and can evaluate the importance degree of a word to a Document, the TF is Term Frequency (Term Frequency) and means that the importance of the word is increased in proportion to the number of times the word appears in a paragraph/sentence, the IDF is an Inverse text Frequency index (Inverse Document Frequency) and means that the word commonly appearing in all paragraphs is penalized and reduced in Inverse proportion to the Frequency of the word appearing in other paragraphs/sentences, such as common words, pronouns and the like, and the calculation formula of the TF-IDF is as follows:
Figure BDA0002577697230000091
Figure BDA0002577697230000092
TF-IDF=TF*IDF
and (3) calculating a text vector of the document content by using the TF-IDF score as the weight of the participles and selecting a predetermined number of the participles as keywords according to the sequence of the weights from large to small:
Figure BDA0002577697230000093
wherein Document Vector represents the text Vector of the content, n represents the number of keywords in the content, and keywordiRepresenting the weight, vector, of a certain keyword i in the content of each documentiAnd the average value after the product of the word vector representing the keyword and the word vector is the text vector of the document content. And clustering a plurality of document contents based on the text vectors so that the document contents with high text vector similarity are in the same document category, wherein the clustering method is not limited by the invention. In a plurality of mathematical modeling methods for public opinion crisis early warning mechanism research in the prior art, only iteration and improvement of mathematical models and parameters are considered, and text information of public opinion contents is not brought into the research range. The training of word vectors and the calculation of text vectors are carried out according to the public sentiment content, and the text information characteristics of the public sentiment content are brought into consideration, so that the possibility of finding the public sentiment events with different themes is provided.
According to another embodiment of the present invention, the document content further includes a document title, and after generating at least one document category, the method further includes the steps of: and acquiring the associated browsing times of all the document contents in each document category, and taking the title of the document content with the highest associated browsing time as the corresponding event topic name.
Further, when the next period starts, obtaining a plurality of newly added document contents, and judging whether the similarity between the text vector of each newly added document content and the text vector of the document contents under the current existing document category is smaller than a preset threshold value or not; if the similarity is smaller than the preset threshold value, the document content is divided into the existing document categories, and the text similarity of all the document contents is prevented from being recalculated before the period starts each time. The judgment logic is as follows:
for arbitrarily take a certain document content a of a certain document category:
the similarity between the content B of the new-added document and the content A is greater than a threshold value a:
including other content B into the same document category set of the content A
Else puts other content B into the non-clustered set U
For content in the non-clustered collection U: the clustering step is recursively performed until there is nothing in the non-clustered set U. The threshold a can be obtained by trial calculation for many times, and the threshold with the best division result is taken as the value of a.
Next, in step S220, based on the browsing times associated with the document content, the browsing times corresponding to the target event are obtained, where the target event is an event corresponding to any document category in the at least one document category.
And acquiring the associated browsing times of all the document contents in each document type, and adding to obtain the browsing times corresponding to the target event. For example, performing a K-Means-like clustering operation according to the text vector of the document content, the following event results can be obtained as shown in table 3, wherein the column is the browsing times associated with the document content under the document category.
TABLE 3
Figure BDA0002577697230000101
Figure BDA0002577697230000111
The corresponding number of views of the target event before the start of the deadline processing period is shown in table 4.
TABLE 4
Document category 1 Document category 2 Document category 3 Document category 4 Document category 5
190 124 130 41 156
Subsequently, in step S230, a growing curve fitting is performed on the historical browsing times of the target event. The growth curve is established according to the rule of public sentiment propagation, including outbreak period and calm period. The outbreak period begins at the stage that the public sentiment event just happens and the number of browsed people changes slowly from zero, and then the stage is that the number of browsed people increases rapidly. The beginning of the quiet period refers to a period from a period of rapid increase of the number of browsed people to a period of slow increase of the number of browsed people until the number of browsed people is nearly unchanged after the public sentiment event is exploded from the event. Because the evolution of the public sentiment events is related to the browsing times, the simulation modeling of each public sentiment event is not needed separately.
Further, the growth curve is a Gompertz curve. The Gompertz curve can be well fitted to the growth of a biological population, and essentially reveals the change development rule of the life cycle, namely the number of the initial stages is slowly increased; the growth of the development stage is gradually accelerated; after the mature stage is reached, the growth speed is slowed down and gradually tends to an extreme value, and finally the growth change rule is stopped, so that the characteristic that public sentiment events can quickly flow on the internet after a certain emergency occurs is met. At this time, if the corresponding service personnel do not take effective measures, the event can quickly explode into a network hotspot, and the negative effect is further increased.
Specifically, the fitting process of the Gompertz curve in combination with the number of viewers is as follows:
(1) k (f), where f is a continuous differentiable function of the number of viewers over time t, and k (f) is the rate of increase of the function f.
The number of views corresponding to the target event is equivalent to the number of people involved in the event, and since the number of people involved in an event changes with time and is a continuous trend, it is feasible to assume that f is continuous or differentiable. The change rate of the number of browsing people is related to the number f of people and the growth rate k (f), and the correlation is positive, i.e. the more the number f of people, the larger the growth rate k (f), the larger the change rate df/dt of the number of browsing people.
(2) And K (f) r (1-f/K), wherein K is the upper limit of the number of browsing people corresponding to the public sentiment event, r is the maximum growth rate of the number of browsing people of the public sentiment event, and K (f) is in direct proportion to the number of the rest unknown people (1-f/K).
(3) Combining (1) and (2) to solve a differential equation to obtain
Figure BDA0002577697230000121
Wherein f is0The initial number of viewers.
(4) When f is0<<When K is, obtain
Figure BDA0002577697230000122
Wherein
Figure BDA0002577697230000123
b=e-rt
(5) Calculating the third derivative for f (t) can obtain the characteristic point of the Gompertz curve:
Figure BDA0002577697230000124
Figure BDA0002577697230000125
Figure BDA0002577697230000126
order to
Figure BDA0002577697230000127
The first characteristic time point can be obtained
Figure BDA0002577697230000128
Second time characteristic point
Figure BDA0002577697230000129
t1The increase of the change rate of the number of the browsed people corresponding to the representative target event reaches the maximum value t2The reduction range of the change rate of the number of the browsing persons corresponding to the representative target event reaches a maximum value.
Finally, in step S240, when the first characteristic time point of the fitted growth curve is greater than zero, the target event is identified as a public sentiment event, and it is determined that the public sentiment event enters an outbreak period, at which the increase of the change rate of the number of browsing people corresponding to the target event reaches a maximum value.
If the historical browsing times of the target event are shown in table 5, and the growth curve is a Gompertz curve as an example for fitting, and the predetermined period of time is 1 day, dt is 1 day.
(1) The increase rate r of a certain event can be obtained by comparing the two days before and after the accumulated browsing times of the target event, and if the increase rate is greater than 1, only the upper limit 1 of the increase rate is taken as r;
(2) the value a, i.e. the current number of browsing people/the initial number of browsing people-1, can be obtained by comparing the accumulated number of browsing times of the current target event with the initial number of browsing times.
(3) According to the formula
Figure BDA00025776972300001210
A first characteristic time point is calculated.
TABLE 5
Target event time Number of times of browsing r a
2019/7/9 366 0.551 0.551 -3.473
2019/7/10 564 0.541 1.390 -1.826
2019/7/11 580 0.028 1.458 -33.140
2019/7/12 664 0.145 1.814 -4.983
2019/7/13 690 0.039 1.924 -16.924
2019/7/14 706 0.023 1.992 -27.085
2019/7/15 1548 1.000 5.559 0.334
2019/7/16 2932 0.894 11.424 1.251
It can be seen that the first characteristic time point of 2019/7/15 is greater than zero, indicating that the event has met the onset characteristics of the outbreak to which the growth curve was fitted, identifying the event as a public sentiment event, and has entered the outbreak. Specifically, the public sentiment event entering the outbreak period refers to a period from a period when the public sentiment event just occurs and the number of browsing people slowly changes from zero to a period when the number of browsing people rapidly increases. The time node is the best time when the platform and the user strive for the initiative and the speaking right, if the platform masters the initiative of the network public sentiment in the time period, the public sentiment situation in the later period is easy to master and control, and on the contrary, if the user masters the initiative in the time period, the public sentiment diffusion is serious. According to an embodiment of the invention, in an actual business scene, if a certain event enters an outbreak period, a public sentiment early warning prompt short message and an email are sent to related personnel to remind the opposite party of paying attention to the public sentiment event, and the opposite party can check all related contents under the public sentiment event to determine how to process. The related personnel can be public opinion processing groups of the platform connected with the analyzed data, or the public opinion processing groups can be classified under the divided departments of the platform connected with the data according to keywords of the public opinion events and only pushed to the personnel of the related departments, so that the public opinion events can be further accurately and efficiently processed.
According to another embodiment of the present invention, when the second characteristic time points of the fitted growth curve are all greater than a preset value within a predetermined number of consecutive periods, it is determined that the public sentiment event enters a quiet period, and the reduction range of the change rate of the number of viewers corresponding to the target event reaches a maximum value.
Further, if the historical browsing times of the target event are as shown in table 6, the historical browsing times are calculated according to the formula
Figure BDA0002577697230000131
A second characteristic time point is calculated. For example, the predetermined period is 1 day, and 7 consecutive periods (i.e. 7 days) may be selected to predict that the ending time of the public sentiment event is greater than the preset value of 100 days, and it is determined that the public sentiment event enters the quiet period. It should be noted that the predetermined times and the preset value can be determined according to a plurality of checking calculations or experiences of business personnelAnd (6) adjusting the rows.
TABLE 6
Target event time Number of times of browsing r a
2019/7/9 366 0.551 0.551 -3.473
2019/8/8 1301214 0.125 5512.619 79.698
2019/8/9 1396164 0.073 5914.949 137.072
2019/8/10 1444104 0.034 6118.085 292.279
2019/8/11 1486116 0.029 6296.102 345.958
2019/8/12 1530206 0.030 6482.924 340.229
2019/8/13 1572780 0.028 6663.322 363.784
2019/8/14 1615456 0.027 6844.153 373.998
2019/8/15 1650404 0.022 6992.237 470.081
It can be seen that if 7 consecutive days before 2019/8/15, the end time of the public sentiment event is predicted to be greater than the preset value of 100 days, then the public sentiment event enters the quiet period. At this time, short messages and mails can be sent to remind relevant people that a certain public sentiment event enters a quiet period, attention can be omitted or other corresponding measures can be taken, and the processing measures are not limited by the invention.
According to the technical scheme of the invention, a large amount of user production contents on the Internet are analyzed, one production content is a document content, a plurality of document contents are clustered in one period to obtain at least one document category, and all document contents under each document category correspond to the same event, so that support is provided for the discovery of public sentiments with different themes in the multi-service field; and then acquiring browsing times corresponding to the target event according to the correlation times of the document contents under the document category corresponding to the target event, fitting a growth curve of the historical browsing times of the target event, wherein the growth curve is established according to the public sentiment propagation rule and comprises an outbreak period and a quiet period, judging whether the event enters the outbreak period or not according to a first characteristic time point of the growth curve obtained by fitting, and the event which does not enter the outbreak period does not need to be processed, so that the public sentiment event is automatically supervised.
Wherein, the public sentiment event entering the outbreak period refers to a stage that the public sentiment event just happens from the event and the number of browsing people slowly changes from zero to a stage that the number of browsing people rapidly increases; the public sentiment event entering the calm period means that the number of browsing people is from a stage of rapid increase after the public sentiment event is exploded to a stage of slow increase until the number of browsing people is nearly unchanged. By uniformly quantifying the basis of public sentiment event discovery and evolution prediction, time nodes at different stages can be predicted without independently carrying out simulation modeling on each public sentiment event, different coping processes are established at different stages for each industry, and dynamic and rapid decision support is provided.
Fig. 3 illustrates a schematic diagram of a public opinion event discovery apparatus 300 according to an embodiment of the present invention. As shown in fig. 3, the public sentiment event discovering apparatus includes a clustering module 310, a browsing times obtaining module 320, a growth curve fitting module 330, and a public sentiment event identifying module 340. The public opinion event discovery device is connected with a data storage device, and the data storage device is suitable for storing a plurality of document contents, and each document content is associated with browsing times. The data storage device typically resides in an enterprise user-oriented platform server. The server is connected with a plurality of clients, collects the content and related information produced by the client users on the platform website, and stores the collected information in the data storage device. The public opinion event discovery device analyzes a large amount of user production contents on the Internet, and one production content is a document content.
The clustering module 310 is configured to perform clustering processing on a plurality of document contents in a predetermined time period to generate at least one document category, where all document contents in each document category correspond to the same event. Since document content belongs to user production content and is accumulated more and more as time advances, analysis of public sentiment events is also based on the change of document content along with time. The method is carried out according to the set period, all the existing document contents are analyzed according to the period, the preset time length of the period can be set according to the service scene, for example, the evaluation feedback content of a user for a certain product does not change too much, and the time length of the period can be set to one day.
The user-produced content is divided into positive content and negative content, the positive content does not need to be supervised, so the clustering module 310 is further configured to filter the plurality of documents to remove documents with non-negative content before clustering the plurality of documents at predetermined time intervals. The clustering process is performed on the contents of a plurality of documents, and various existing clustering algorithms can be adopted.
The document content further includes a document title, and after at least one document category is generated, the clustering module 310 is further configured to obtain associated browsing times of all document contents in each document category, and use the title of the document content with the highest associated browsing times as a corresponding event topic name, which is beneficial for related people to quickly grasp a key point of a public opinion event.
The browsing frequency obtaining module 320 is configured to obtain, according to the browsing frequency associated with the document content, a browsing frequency corresponding to a target event, where the target event is an event corresponding to any document category of the at least one document category. Specifically, the associated browsing times of all document contents in each document category are obtained and added to obtain the browsing times corresponding to the target event.
The growing curve fitting module 330 is used for performing growing curve fitting on the historical browsing times of the target event. The growth curve is established according to the rule of public sentiment propagation, including outbreak period and calm period. The outbreak period begins at the stage that the public sentiment event just happens and the number of browsed people changes slowly from zero, and then the stage is that the number of browsed people increases rapidly. The beginning of the quiet period refers to a period from a period of rapid increase of the number of browsed people to a period of slow increase of the number of browsed people until the number of browsed people is nearly unchanged after the public sentiment event is exploded from the event. Because the evolution of the public sentiment events is related to the browsing times, the simulation modeling of each public sentiment event is not needed separately.
The public sentiment event recognition module 340 is configured to recognize the target event as a public sentiment event and determine that the public sentiment event enters an outbreak period when the first characteristic time point of the fitted growth curve is greater than zero. The public sentiment event recognition module 340 is further configured to determine that the public sentiment event enters a quiet period when the second characteristic time points of the fitted growth curve are all greater than a preset value within a predetermined number of continuous periods, and the reduction range of the change rate of the number of browsing people corresponding to the target event reaches a maximum value at this time.
A8, the method of any one of A1-A7, wherein the growth curve is a Gompertz curve.
A9, the method according to any one of a1-A8, wherein the clustering the plurality of document contents for a predetermined period of time to generate at least one document category comprises:
for each document content in the plurality of document contents, performing word segmentation processing on the document content to obtain a plurality of words, and calculating a word vector of each word;
calculating the weight of each participle, and determining a keyword based on the weight of the participle;
acquiring a text vector of the document content according to the word vector and the weight of the keyword;
and clustering according to the text vector of each document content so as to divide the document contents with high text vector similarity into the same document category.
A10, the method as in a9, wherein the calculating the weight of each participle comprises the steps of:
and evaluating the importance degree of each participle to the document content by using a TF-IDF statistical method, wherein the higher the importance degree is, the higher the weight of the participle is.
A11, the method as in a9 or a10, wherein the clustering process is performed on the document contents for a predetermined period of time to generate at least one document category, further comprising:
when the next period starts, acquiring a plurality of newly added document contents, and judging whether the similarity of the text vector of each newly added document content and the text vector of the document contents under the current existing document category is smaller than a preset threshold value or not;
if the document content is smaller than the preset threshold value, the document content is divided into the existing document categories.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.
As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims (10)

1. A public opinion event discovery method, the method being adapted to be executed in a computing device, the computing device being connected to a data storage adapted to store a plurality of document contents, each document content being associated with a number of views, the method comprising the steps of:
clustering the plurality of document contents in a period with preset time to generate at least one document category, wherein all the document contents in each document category correspond to the same event;
acquiring browsing times corresponding to a target event based on browsing times associated with document contents, wherein the target event is an event corresponding to any document category in the at least one document category;
carrying out growth curve fitting on the historical browsing times of the target event;
and when the first characteristic time point of the growth curve obtained by fitting is greater than zero, identifying the target event as a public sentiment event, and determining that the public sentiment event enters an outbreak period.
2. The method of claim 1, further comprising:
and when second characteristic time points of the fitted growth curve are all larger than a preset value in a preset number of continuous periods, determining that the public sentiment event enters a calm period.
3. The method of claim 1, wherein before said clustering said plurality of documents at predetermined time intervals, further comprising the steps of:
filtering the plurality of documents to remove documents with non-negative content.
4. The method according to any one of claims 1-3, wherein all document contents under each document category correspond to the same event, the document contents further include document titles, and after generating at least one document category, the method further comprises the steps of:
and acquiring the associated browsing times of all the document contents in each document category, and taking the title of the document content with the highest associated browsing time as the corresponding event topic name.
5. The method according to any one of claims 1-4, wherein the obtaining of the browsing times corresponding to the target event based on the browsing times associated with the document content comprises:
and acquiring the associated browsing times of all the document contents in each document type, and adding to obtain the browsing times corresponding to the target event.
6. The method of any one of claims 1 to 5, wherein the public sentiment event entering the outbreak period means that the public sentiment event enters a period from a period when a number of browsed persons is slowly changed from zero to a period when the number of browsed persons is rapidly increased.
7. The method of any one of claims 1-6, wherein the public sentiment event entering the calm period means that the number of browsing people is from a period of rapid increase after the public sentiment event outbreak to a period of slow increase until the number of browsing people approaches invariable.
8. A public opinion event discovery apparatus, the apparatus being connected to a data storage device, the data storage device being adapted to store a plurality of document contents, each document content being associated with a number of views, the apparatus comprising:
the clustering module is used for clustering the plurality of document contents in a period with preset time length to generate at least one document category, wherein all the document contents in each document category correspond to the same event;
the browsing frequency acquisition module is used for acquiring the browsing frequency corresponding to a target event according to the browsing frequency associated with the document content, wherein the target event is an event corresponding to any one of the at least one document category;
the growth curve fitting module is used for performing growth curve fitting on the historical browsing times of the target event;
and the public sentiment event identification module is used for identifying the target event as a public sentiment event when the first characteristic time point of the growth curve obtained by fitting is greater than zero, and determining that the public sentiment event enters an outbreak period.
9. A computing device, comprising:
at least one processor; and
a memory storing program instructions, wherein the program instructions are configured to be adapted to be executed by the at least one processor, the program instructions comprising instructions for performing the public opinion event discovery method of claims 1-7.
10. A readable storage medium storing program instructions which, when read and executed by a client, cause the client to perform the method of any one of claims 1-7.
CN202010658727.4A 2020-07-09 2020-07-09 Public opinion event discovery method and device and computing equipment Pending CN111931032A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010658727.4A CN111931032A (en) 2020-07-09 2020-07-09 Public opinion event discovery method and device and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010658727.4A CN111931032A (en) 2020-07-09 2020-07-09 Public opinion event discovery method and device and computing equipment

Publications (1)

Publication Number Publication Date
CN111931032A true CN111931032A (en) 2020-11-13

Family

ID=73314105

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010658727.4A Pending CN111931032A (en) 2020-07-09 2020-07-09 Public opinion event discovery method and device and computing equipment

Country Status (1)

Country Link
CN (1) CN111931032A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109857869A (en) * 2019-01-26 2019-06-07 北京工业大学 A kind of hot topic prediction technique based on Ap increment cluster and network primitive
CN110399478A (en) * 2018-04-19 2019-11-01 清华大学 Event finds method and apparatus
CN110750636A (en) * 2018-07-04 2020-02-04 百度在线网络技术(北京)有限公司 Network public opinion information processing method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399478A (en) * 2018-04-19 2019-11-01 清华大学 Event finds method and apparatus
CN110750636A (en) * 2018-07-04 2020-02-04 百度在线网络技术(北京)有限公司 Network public opinion information processing method and device
CN109857869A (en) * 2019-01-26 2019-06-07 北京工业大学 A kind of hot topic prediction technique based on Ap increment cluster and network primitive

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
兰月新 等: "公共危机事件网络舆情热度模型研究", 情报科学, vol. 34, no. 02, pages 32 - 36 *

Similar Documents

Publication Publication Date Title
US10977447B2 (en) Method and device for identifying a user interest, and computer-readable storage medium
CN111143576A (en) Event-oriented dynamic knowledge graph construction method and device
CN108595660A (en) Label information generation method, device, storage medium and the equipment of multimedia resource
US20140095150A1 (en) Emotion identification system and method
CN110929145B (en) Public opinion analysis method, public opinion analysis device, computer device and storage medium
KR20170034206A (en) Apparatus and Method for Topic Category Classification of Social Media Text based on Cross-Media Analysis
CN107797982A (en) For identifying the method, apparatus and equipment of text type
WO2017198031A1 (en) Semantic parsing method and apparatus
CN106294330B (en) Scientific and technological text selection method and device
CN110502742B (en) Complex entity extraction method, device, medium and system
JP2021501402A (en) Ranking of documents based on semantic abundance
US20200278976A1 (en) Method and device for evaluating comment quality, and computer readable storage medium
CN109871433B (en) Method, device, equipment and medium for calculating relevance between document and topic
CN111538931A (en) Big data-based public opinion monitoring method and device, computer equipment and medium
CN109344246B (en) Electronic questionnaire generating method, computer readable storage medium and terminal device
CN108763202B (en) Method, device and equipment for identifying sensitive text and readable storage medium
WO2019085332A1 (en) Financial data analysis method, application server, and computer readable storage medium
CN117271736A (en) Question-answer pair generation method and system, electronic equipment and storage medium
CN113806486B (en) Method and device for calculating long text similarity, storage medium and electronic device
CN110019556B (en) Topic news acquisition method, device and equipment thereof
CN110019763B (en) Text filtering method, system, equipment and computer readable storage medium
CN107092679B (en) Feature word vector obtaining method and text classification method and device
CN111931032A (en) Public opinion event discovery method and device and computing equipment
CN111178038B (en) Document similarity recognition method and device based on latent semantic analysis
CN114117057A (en) Keyword extraction method of product feedback information and terminal equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination