WO2019136920A1 - Presentation method for visualization of topic evolution, application server, and computer readable storage medium - Google Patents

Presentation method for visualization of topic evolution, application server, and computer readable storage medium Download PDF

Info

Publication number
WO2019136920A1
WO2019136920A1 PCT/CN2018/090694 CN2018090694W WO2019136920A1 WO 2019136920 A1 WO2019136920 A1 WO 2019136920A1 CN 2018090694 W CN2018090694 W CN 2018090694W WO 2019136920 A1 WO2019136920 A1 WO 2019136920A1
Authority
WO
WIPO (PCT)
Prior art keywords
topics
topic
cluster
keywords
stream
Prior art date
Application number
PCT/CN2018/090694
Other languages
French (fr)
Chinese (zh)
Inventor
王健宗
吴天博
黄章成
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019136920A1 publication Critical patent/WO2019136920A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Definitions

  • the present application relates to the field of image processing technologies, and in particular, to a visual presentation method for an item evolution, an application server, and a computer readable storage medium.
  • the present application proposes a visual presentation method for an item evolution, an application server, and a computer readable storage medium, which can visually display an event evolution process of an event, so that the user can quickly understand and analyze the evolution of the entire event. process.
  • the present application provides an application server, where the application server includes a memory and a processor, where the memory stores a visual presentation system for topic evolution that can be run on the processor, the topic
  • the evolved visual presentation system implements the following steps when executed by the processor:
  • the present application further provides a visual presentation method for topic evolution, which is applied to an application server, and the method includes:
  • the present application further provides a computer readable storage medium storing a visual presentation system of topic evolution, the visual evolution system of the topic evolution being at least one processor Executing, to cause the at least one processor to perform the steps of the visual presentation method as evolved from the above topic.
  • the visual presentation method, the application server, and the computer readable storage medium of the topic evolution proposed by the present application firstly extract the topics of multiple text materials related to the same event, and determine each of the topics. a relationship between the two to create a theme stream; secondly, selecting a plurality of first topics containing important events from the plurality of the topics; and further, extracting keywords of each of the first topics, and determining each An association relationship of keywords of the first topic; finally, adding keywords of each of the first topics and their associations to the topic stream to generate topic evolution corresponding to the plurality of text materials Context map.
  • the topic can be mined for sequential social events, and the evolution trend of the event can be visualized through the theme flow over time, enabling users to have a better understanding of the evolution of the topic and the major events. Avoid topic drift caused by topic association, and help users to understand the deep meaning of the topic in depth and avoid misunderstanding or decision.
  • 1 is a schematic diagram of an optional hardware architecture of an application server of the present application
  • FIG. 2 is a schematic diagram of a program module of a first embodiment of a visual presentation system in which the subject matter of the present application evolves;
  • FIG. 3 is a schematic diagram of a program module of a second embodiment of a visual presentation system of the subject matter evolution of the present application;
  • FIG. 4 is a schematic flowchart of an implementation process of a first embodiment of a visual display method for a topic evolution of the present application
  • FIG. 5 is a schematic diagram of an implementation process of a second embodiment of a visual display method for the evolution of the topic of the present application.
  • First extraction module 101 Screening module 102
  • Second extraction module 103 Build module 104 Marking module 105
  • FIG. 1 it is a schematic diagram of an optional hardware architecture of the application server 2 of the present application.
  • the application server 2 may include, but is not limited to, the memory 11, the processor 12, and the network interface 13 being communicably connected to each other through a system bus. It is pointed out that Figure 1 only shows the application server 2 with components 11-13, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.
  • the application server 2 may be a computing device such as a rack server, a blade server, a tower server, or a rack server.
  • the application server 2 may be a stand-alone server or a server cluster composed of multiple servers.
  • the memory 11 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (eg, SD or DX memory, etc.), a random access memory (RAM), a static Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, and the like.
  • the memory 11 may be an internal storage unit of the application server 2, such as a hard disk or memory of the application server 2.
  • the memory 11 may also be an external storage device of the application server 2, such as a plug-in hard disk equipped on the application server 2, a smart memory card (SMC), and a secure digital number. (Secure Digital, SD) card, flash card, etc.
  • the memory 11 can also include both the internal storage unit of the application server 2 and its external storage device.
  • the memory 11 is generally used to store an operating system installed in the application server 2 and various types of application software, such as program code of the visual presentation system 100 of the topic evolution. Further, the memory 11 can also be used to temporarily store various types of data that have been output or are to be output.
  • the processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments.
  • the processor 12 is typically used to control the overall operation of the application server 2, such as performing control and processing related to data interaction or communication with the terminal device 1.
  • the processor 12 is configured to run program code or process data stored in the memory 11, such as a visual presentation system 100 that runs the topic evolution.
  • the network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the application server 2 and other electronic devices.
  • the present application proposes a visual presentation system 100 for topic evolution.
  • FIG. 2 it is a program module diagram of the first embodiment of the visual presentation system 100 in which the subject matter of the present application evolves.
  • the visual representation system 100 of the topic evolution includes a series of computer program instructions stored in the memory 11, and when the computer program instructions are executed by the processor 12, the topic evolution of the embodiments of the present application can be implemented. Visualization of the operation.
  • the visual evolution system 100 of topic evolution may be divided into one or more modules based on the particular operations implemented by the various portions of the computer program instructions. For example, in FIG. 2, the visualization of the topic presentation system 100 can be divided into a first extraction module 101, a screening module 102, a second extraction module 103, and a generation module 104. among them:
  • the first extraction module 101 is configured to extract topics related to multiple text materials of the same event, and determine an association relationship between each of the topics to establish a theme stream.
  • the text material may be online news text
  • the first extraction module 101 may extract a plurality of news texts related to the same event by accessing the network.
  • a plurality of news texts related to the event may be searched for and extracted from the network by inputting a keyword of an event (for example, a place where the event occurs, a main character, an event, etc.), and then multiple news texts are extracted according to the event.
  • a keyword of an event for example, a place where the event occurs, a main character, an event, etc.
  • the first extraction module 101 may acquire elements such as a person, a place, an event, and the like of the current news text, and generate an event summary as a theme of the news text based on the elements.
  • the first extraction module 101 is further configured to preprocess the extracted plurality of text materials.
  • the pre-processing may include: segmenting, simplifying, replacing ambiguous words, removing stop words, low frequency words, numbers, punctuation marks, and the like.
  • the first extraction module 101 may model each topic by a layered Dirichlet process, and record the ith text data that arrives at time t as The cluster in which it is located is If at two points in time, Different cluster marks, ie versus Inconsistent, then you can think The subject has changed so that two quantities can be calculated to derive the split and merge of the subject.
  • the two quantities are the ratio from cluster s in cluster r from time t-1 to time t:
  • the generation and termination of the subject matter can be detected by applying a hash table.
  • each topic has a unique storage location corresponding to the hash table to detect the generation and end of the topic.
  • the first extraction module 101 may sort the topics of each text material according to the posting time of each text material.
  • the theme stream established by the first extraction module 101 represents the evolution of a plurality of topics over time, and the height of the topic stream may represent the number of documents belonging to the topic.
  • the theme stream can also be divided into several branches, and several branches can also be combined into one topic.
  • the screening module 102 is configured to filter a plurality of first topics including important events from a plurality of the topics.
  • the plurality of first topics are preferably subject matter that is split, merged.
  • the splitting and merging of topics can be represented by scores.
  • an information entropy algorithm can be used to calculate the score.
  • the scores for the merged topic can be calculated by the following formula:
  • R(r,t) is the ordering score of cluster r at time t
  • N r is the number of elements flowing into cluster r
  • the score of the subject with splitting can be calculated by the following formula:
  • R(s, t) is the ordering score of cluster s at time t
  • N s is the number of elements flowing into cluster r.
  • the screening module 102 may select, according to the calculated scores of each topic, a plurality of topics in the front row of the score sorting (the scores may be arranged from large to small) as the first topic including the important events. For example, the screening module 102 selects the top ten topics of the score ranking as the first topic.
  • the second extraction module 103 is configured to extract keywords of each of the first topics, and determine an association relationship of keywords of each of the first topics.
  • the second extraction module 103 may extract a keyword of each of the first topics using a TF-IDF algorithm.
  • the TF-IDF algorithm can be used to assess how important a word is in a subject text. The importance of a word increases proportionally with the number of times it appears in the text.
  • the TF-IDF value of a certain word is obtained by word frequency (TF) and inverse document frequency (IDF), and the TF-IDF value is higher if the word is more important to the subject text.
  • TF word frequency
  • IDF inverse document frequency
  • the second extraction module 103 can rank the TF-IDF value in the first few words as the keyword of the subject text. For example, a word with the TF-IDF value ranked in the top five is used as the keyword of the first topic.
  • the second extraction module 103 may determine an association relationship of keywords of each of the first topics by a layered Dirichlet process. The second extraction module 103 may further determine the association relationship of the keywords of each of the first topics in combination with each of the first topics at a node location of the topic stream.
  • the generating module 104 is configured to add a keyword of each of the first topics and an association relationship thereof to the topic stream to generate a topic evolution context map corresponding to the plurality of text materials.
  • the generating module 104 may visualize keywords of each of the first topics and their associated relationships as word clouds overlapping on the topic stream.
  • the topic evolution context map can be displayed by a display module (not shown).
  • the visual presentation system 100 of the topic evolution proposed by the present application firstly extracts topics of a plurality of text materials related to the same event, and determines an association relationship between each of the topics to establish a topic stream; secondly, filtering a plurality of first topics including important events from a plurality of the topics; further, extracting keywords of each of the first topics, and determining each of the first topics Keyword associations; finally, keywords of each of the first topics and their associated relationships are added to the topic stream to generate a topic evolution context map corresponding to the plurality of text materials.
  • the topic can be mined for sequential social events, and the evolution trend of the event can be visualized through the theme flow over time, enabling users to have a better understanding of the evolution of the topic and the major events. Avoid topic drift caused by topic association, and help users to understand the deep meaning of the topic in depth and avoid misunderstanding or decision.
  • the visual representation system 100 of the topic evolution includes a series of computer program instructions stored in the memory 11, and when the computer program instructions are executed by the processor 12, the topic evolution of the embodiments of the present application can be implemented. Visualization of the operation.
  • the visual evolution system 100 of topic evolution may be divided into one or more modules based on the particular operations implemented by the various portions of the computer program instructions. For example, in FIG. 3, the visual evolution system 100 of the topic evolution may be divided into a first extraction module 101, a screening module 102, a second extraction module 103, a generation module 104, and a labeling module 105.
  • the program modules 101-104 are the same as the first embodiment of the visual presentation system 100 in which the subject matter of the present application evolves, and the labeling module 105 is added thereto. among them:
  • the first extraction module 101 is configured to extract topics related to multiple text materials of the same event, and determine an association relationship between each of the topics to establish a theme stream.
  • the text material may be online news text
  • the first extraction module 101 may extract a plurality of news texts related to the same event by accessing the network. Specifically, a plurality of news texts related to the event may be searched for and extracted from the network by inputting a keyword of an event (for example, a place where the event occurs, a main character, an event, etc.), and then multiple news texts are extracted according to the event. To extract its theme.
  • the first extraction module 101 may acquire elements such as characters, places, events, and the like of the current news text, and generate an event summary as the topic of the news text based on the elements.
  • the first extraction module 101 is further configured to preprocess the extracted plurality of text materials.
  • the pre-processing may include: segmenting, simplifying, replacing ambiguous words, removing stop words, low frequency words, numbers, punctuation marks, and the like.
  • the first extraction module 101 may model each topic by a layered Dirichlet process, and record the ith text data that arrives at time t as The cluster in which it is located is If at two points in time, Different cluster marks, ie versus Inconsistent, then you can think The subject has changed so that two quantities can be calculated to derive the split and merge of the subject.
  • the two quantities are the ratio from cluster s in cluster r from time t-1 to time t:
  • the generation and termination of the subject matter can be detected by applying a hash table.
  • each topic has a unique storage location corresponding to the hash table to detect the generation and end of the topic.
  • the first extraction module 101 may sort the topics of each text material according to the posting time of each text material.
  • the theme stream established by the first extraction module 101 represents the evolution of a plurality of topics over time, and the height of the topic stream may represent the number of documents belonging to the topic.
  • the theme stream can also be divided into several branches, and several branches can also be combined into one topic.
  • the labeling module 105 is configured to identify, generate, split, merge, and end node locations in the topic stream for each of the topics, and apply the node locations of each of the topics generated, split, merged, and ended.
  • Different marker symbols are marked. For example, a solid circle is used to represent the generation of the theme, an open circle is used to represent the end of the theme, and a three-pronged mark using different angles represents the splitting and merging of the theme, respectively.
  • the labeling module 105 can use a hash table and a layered Dirichlet process to identify, generate, split, merge, and end the position of each node in the topic stream.
  • the position of the nodes that generate, split, merge, and end each of the topics is marked with different preset markers.
  • the indicator module 105 can also be marked with a color similar to the original theme.
  • the screening module 102 is configured to filter a plurality of first topics including important events from a plurality of the topics.
  • the plurality of first topics are preferably subject matter that is split, merged.
  • the splitting and merging of topics can be represented by scores.
  • an information entropy algorithm can be used to calculate the score.
  • the scores for the merged topic can be calculated by the following formula:
  • R(r,t) is the ordering score of cluster r at time t
  • N r is the number of elements flowing into cluster r
  • the score of the subject with splitting can be calculated by the following formula:
  • R(s, t) is the ordering score of cluster s at time t
  • N s is the number of elements flowing into cluster r.
  • the screening module 102 may select, according to the calculated scores of each topic, a plurality of topics in the front row of the score sorting (the scores may be arranged from large to small) as the first topic including the important events. For example, the screening module 102 selects the top ten topics of the score ranking as the first topic.
  • the first subject matter may also be labeled with a particular color or indicia on the subject stream.
  • the second extraction module 103 is configured to extract keywords of each of the first topics, and determine an association relationship of keywords of each of the first topics.
  • the second extraction module 103 may extract a keyword of each of the first topics using a TF-IDF algorithm.
  • the TF-IDF algorithm can be used to assess how important a word is in a subject text. The importance of a word increases proportionally with the number of times it appears in the text.
  • the TF-IDF value of a certain word is obtained by word frequency (TF) and inverse document frequency (IDF), and the TF-IDF value is higher if the word is more important to the subject text.
  • TF word frequency
  • IDF inverse document frequency
  • the second extraction module 103 can classify the first few words of the TF-IDF value as keywords of the topic text. For example, a word with the TF-IDF value ranked in the top five is used as the keyword of the first topic.
  • the second extraction module 103 may determine an association relationship of keywords of each of the first topics by a layered Dirichlet process. The second extraction module 103 may further determine the association relationship of the keywords of each of the first topics in combination with each of the first topics at a node location of the topic stream.
  • the generating module 104 is configured to add a keyword of each of the first topics and an association relationship thereof to the topic stream to generate a topic evolution context map corresponding to the plurality of text materials.
  • the generating module 104 may visualize keywords of each of the first topics and their associated relationships as word clouds overlapping on the topic stream.
  • the topic evolution context map can be displayed by a display module (eg, a projection screen, a display, etc.).
  • the visual presentation system 100 of the topic evolution proposed by the present application firstly extracts topics of a plurality of text materials related to the same event, and determines an association relationship between each of the topics to establish a topic stream; secondly, identifying, generating, splitting, merging, ending the node locations in the topic stream for each of the topics, and applying different node locations for each of the topics generated, split, merged, and ended Marking symbols are marked; further, selecting a plurality of first topics including important events from the plurality of the topics; further, extracting keywords of each of the first topics, and determining each of the The association relationship of the keywords of a topic; finally, the keywords of each of the first topics and their associated relationships are added to the topic stream to generate a topic evolution context map corresponding to the plurality of text materials.
  • the topic can be mined for sequential social events, and the evolution trend of the event can be visualized through the theme flow over time, enabling users to have a better understanding of the evolution of the topic and the major events. Avoid topic drift caused by topic association, and help users to understand the deep meaning of the topic in depth and avoid misunderstanding or decision.
  • the present application also proposes a visual display method for topic evolution.
  • FIG. 4 it is a schematic flowchart of the implementation of the first embodiment of the visual display method for the evolution of the topic of the present application.
  • the order of execution of the steps in the flowchart shown in FIG. 4 may be changed according to different requirements, and some steps may be omitted.
  • Step S500 extracting topics of a plurality of text materials related to the same event, and determining an association relationship between each of the topics to establish a theme stream.
  • the text material may be online news text, and multiple news texts related to the same event may be extracted through the access network.
  • a plurality of news texts related to the event may be searched for and extracted from the network by inputting a keyword of an event (for example, a place where the event occurs, a main character, an event, etc.), and then multiple news texts are extracted according to the event.
  • a keyword of an event for example, a place where the event occurs, a main character, an event, etc.
  • an event summary may be generated as a subject of the news text by acquiring elements such as a person, a place, an event, and the like of the current news text.
  • the extracted plurality of text materials may be pre-processed prior to extracting the text material theme.
  • the pre-processing may include: segmenting, simplifying, replacing ambiguous words, removing stop words, low frequency words, numbers, punctuation marks, and the like.
  • each topic can be modeled by a layered Dirichlet process, and the ith text data arriving at time t is recorded as The cluster in which it is located is If at two points in time, Different cluster marks, ie versus Inconsistent, then you can think The subject has changed so that two quantities can be calculated to derive the split and merge of the subject.
  • the two quantities are the ratio from cluster s in cluster r from time t-1 to time t:
  • the generation and termination of the subject matter can be detected by applying a hash table.
  • each topic has a unique storage location corresponding to the hash table to detect the generation and end of the topic.
  • the topics of each text material may be ordered according to the posting time of each text material.
  • the created topic stream can represent the evolution of multiple topics over time, and the height of the topic stream can represent the number of documents belonging to that topic.
  • the theme stream can also be divided into several branches, and several branches can also be combined into one topic.
  • Step S502 selecting a plurality of first topics including important events from the plurality of the topics.
  • the plurality of first topics are preferably subject matter that is split, merged.
  • the splitting and merging of topics can be represented by scores.
  • an information entropy algorithm can be used to calculate the score.
  • the scores for the merged topic can be calculated by the following formula:
  • R(r,t) is the ordering score of cluster r at time t
  • N r is the number of elements flowing into cluster r
  • the score of the subject with splitting can be calculated by the following formula:
  • R(s, t) is the ordering score of cluster s at time t
  • N s is the number of elements flowing into cluster r.
  • a plurality of topics in the front row of the score sorting may be selected as the first topic including the important event according to the calculated score of each topic. For example, the topic of the top ten is sorted by the score as the first topic.
  • Step S504 Extract keywords of each of the first topics, and determine an association relationship of keywords of each of the first topics.
  • a TF-IDF algorithm may be used to extract keywords for each of the first topics.
  • the TF-IDF algorithm can be used to assess how important a word is in a subject text. The importance of a word increases proportionally with the number of times it appears in the text.
  • the TF-IDF value of a certain word is obtained by word frequency (TF) and inverse document frequency (IDF), and the TF-IDF value is higher if the word is more important to the subject text. The bigger.
  • the first few words of the TF-IDF value can be used as keywords for the subject text. For example, a word with the TF-IDF value ranked in the top five is used as the keyword of the first topic.
  • association relationship of keywords of each of the first topics may also be determined by a layered Dirichlet process.
  • the association relationship of the keywords of each of the first topics may be further determined by combining the node locations of the topic streams in each of the first topics.
  • Step S506 adding keywords of each of the first topics and their associated relationships to the topic stream to generate a topic evolution context map corresponding to the plurality of text materials.
  • the keywords of each of the first topics and their associated relationships may be visualized as word clouds overlapping on the topic stream.
  • the topic evolution map can be displayed through projection screens, displays, and other devices.
  • the visual presentation method of the topic evolution proposed by the present application firstly extracts the topics of multiple text materials related to the same event, and determines the association relationship between each of the topics to establish a theme. Flowing; secondly, filtering a plurality of first topics including important events from a plurality of the topics; further, extracting keywords of each of the first topics, and determining keywords of each of the first topics Finally, the keyword of each of the first topics and its associated relationship are added to the topic stream to generate a topic evolution context map corresponding to the plurality of text materials.
  • the topic can be mined for sequential social events, and the evolution trend of the event can be visualized through the theme flow over time, enabling users to have a better understanding of the evolution of the topic and the major events. Avoid topic drift caused by topic association, and help users to understand the deep meaning of the topic in depth and avoid misunderstanding or decision.
  • FIG. 5 there is shown a flow chart of the implementation of the second embodiment of the visual presentation method of the subject matter evolution.
  • the order of execution of the steps in the flowchart shown in FIG. 5 may be changed according to different requirements, and some steps may be omitted.
  • Step S500 extracting topics of a plurality of text materials related to the same event, and determining an association relationship between each of the topics to establish a theme stream.
  • the text material may be online news text, and multiple news texts related to the same event may be extracted through the access network.
  • a plurality of news texts related to the event may be searched for and extracted from the network by inputting a keyword of an event (for example, a place where the event occurs, a main character, an event, etc.), and then multiple news texts are extracted according to the event.
  • a keyword of an event for example, a place where the event occurs, a main character, an event, etc.
  • an event summary may be generated as a subject of the news text by acquiring elements such as a person, a place, an event, and the like of the current news text.
  • the extracted plurality of text materials may be pre-processed prior to extracting the text material theme.
  • the pre-processing may include: segmenting, simplifying, replacing ambiguous words, removing stop words, low frequency words, numbers, punctuation marks, and the like.
  • each topic can be modeled by a layered Dirichlet process, and the ith text data arriving at time t is recorded as The cluster in which it is located is If at two points in time, Different cluster marks, ie versus Inconsistent, then you can think The subject has changed so that two quantities can be calculated to derive the split and merge of the subject.
  • the two quantities are the ratio from cluster s in cluster r from time t-1 to time t:
  • the generation and termination of the subject matter can be detected by applying a hash table.
  • each topic has a unique storage location corresponding to the hash table to detect the generation and end of the topic.
  • the topics of each text material may be ordered according to the posting time of each text material.
  • the created topic stream can represent the evolution of multiple topics over time, and the height of the topic stream can represent the number of documents belonging to that topic.
  • the theme stream can also be divided into several branches, and several branches can also be combined into one topic.
  • Step S508 identifying, generating, splitting, merging, and ending the node positions in the topic stream for each of the topics, and applying different mark symbols to the node positions of each of the topics generated, split, merged, and ended.
  • Mark it For example, a solid circle is used to represent the generation of the theme, an open circle is used to represent the end of the theme, and a three-pronged mark using different angles represents the splitting and merging of the theme, respectively.
  • the hash table and the hierarchical Dirichlet process may be used to identify the generation, splitting, merging, and ending of each of the topics in the topic stream, and thus each of the The position of the nodes that generate, split, merge, and end the theme is marked with different preset markers. For split and merged themes, you can also choose a color that is similar to the original theme.
  • Step S502 selecting a plurality of first topics including important events from the plurality of the topics.
  • the plurality of first topics are preferably subject matter that is split, merged.
  • the splitting and merging of topics can be represented by scores.
  • an information entropy algorithm can be used to calculate the score.
  • the scores for the merged topic can be calculated by the following formula:
  • R(r,t) is the ordering score of cluster r at time t
  • N r is the number of elements flowing into cluster r
  • the score of the subject with splitting can be calculated by the following formula:
  • R(s, t) is the ordering score of cluster s at time t
  • N s is the number of elements flowing into cluster r.
  • a plurality of topics in the front row of the score sorting may be selected as the first topic including the important event according to the calculated score of each topic. For example, the topic of the top ten is sorted by the score as the first topic.
  • the first subject matter may also be labeled with a particular color or indicia on the subject stream.
  • Step S504 Extract keywords of each of the first topics, and determine an association relationship of keywords of each of the first topics.
  • a TF-IDF algorithm may be used to extract keywords for each of the first topics.
  • the TF-IDF algorithm can be used to assess how important a word is in a subject text. The importance of a word increases proportionally with the number of times it appears in the text.
  • the TF-IDF value of a certain word is obtained by word frequency (TF) and inverse document frequency (IDF), and the TF-IDF value is higher if the word is more important to the subject text. The bigger.
  • the first few words of the TF-IDF value can be used as keywords for the subject text. For example, a word with the TF-IDF value ranked in the top five is used as the keyword of the first topic.
  • association relationship of keywords of each of the first topics may also be determined by a layered Dirichlet process.
  • the association relationship of the keywords of each of the first topics may be further determined by combining the node locations of the topic streams in each of the first topics.
  • Step S506 adding keywords of each of the first topics and their associated relationships to the topic stream to generate a topic evolution context map corresponding to the plurality of text materials.
  • the keywords of each of the first topics and their associated relationships may be visualized as word clouds overlapping on the topic stream.
  • the topic evolution map can be displayed through projection screens, displays, and other devices.
  • the visual presentation method of the topic evolution proposed by the present application firstly extracts the topics of multiple text materials related to the same event, and determines the association relationship between each of the topics to establish a theme. Streaming; secondly, identifying, generating, splitting, merging, ending the node locations in the topic stream for each of the topics, and applying different markers to the node locations of each of the topics generated, split, merged, and ended Symbols are marked; further, a plurality of first topics including important events are filtered from a plurality of the topics; and further, keywords of each of the first topics are extracted, and each of the first topics is determined The association of the keywords; finally, adding the keywords of each of the first topics and their associated relationships to the topic stream to generate a topic evolution context map corresponding to the plurality of text materials.
  • the topic can be mined for sequential social events, and the evolution trend of the event can be visualized through the theme flow over time, enabling users to have a better understanding of the evolution of the topic and the major events. Avoid topic drift caused by topic association, and help users to understand the deep meaning of the topic in depth and avoid misunderstanding or decision.
  • the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better.
  • Implementation Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
  • the optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.

Abstract

Disclosed is a presentation method for the visualization of topic evolution, comprising: extracting topics related to a plurality of text material of a same event, and determining an association between each of the topics so as to establish a topic stream; screening from the plurality of topics a plurality of first topics that contain important events; extracting keywords of each of the first topics, and determining an association between the keywords of each of the first topics; and adding the keywords of each of the first topics and the associations thereof to the topic stream, so as to generate topic evolution context graphs corresponding to the plurality of text material. Further provided in the present application is an application server and a computer readable storage medium. The presentation method for the visualization of topic evolution, the application server and the computer readable storage medium provided in the present application may visually display a topic evolution process of an event, such that users may quickly understand and analyze the evolution process of the whole event.

Description

话题演变的可视化展现方法、应用服务器及计算机可读存储介质Visualization method for topic evolution, application server and computer readable storage medium
本申请要求于2018年1月12日提交中国专利局,申请号为201810031859.7、发明名称为“话题演变的可视化展现方法、应用服务器及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to Chinese Patent Application No. 201810031859.7, entitled "Visualization of Topic Evolution, Application Server and Computer Readable Storage Media", filed on January 12, 2018, all of which are entitled The content is incorporated herein by reference.
技术领域Technical field
本申请涉及图像处理技术领域,尤其涉及话题演变的可视化展现方法、应用服务器及计算机可读存储介质。The present application relates to the field of image processing technologies, and in particular, to a visual presentation method for an item evolution, an application server, and a computer readable storage medium.
背景技术Background technique
信息爆炸时代,人们可以从互联网上免费阅读、下载关于一个新闻话题的各类新闻报道。由于网络上关于一个新闻话题(尤其是热点新闻话题)的相关新闻文章数量非常多,导致很难从众多相关的新闻报道中高效、省时地了解目标新闻话题的发展趋势和演变过程。而理解社交媒体上的部分话题的演变对投资者/管理者等有着重要的意义。当投资者/管理者了解话题深层的意义,可以做出合适的判断并据此采取进一步的行动。然而,现有技术在分析话题在时间上的演变是比较困难的,无法快速检测并区别出每个话题及话题中包含的重大事件、演变脉络等,同时对于话题的产生、结束、分裂和合并亦无有效的识别机制。In the era of information explosion, people can read and download all kinds of news reports about a news topic for free from the Internet. Due to the large number of related news articles on a news topic (especially hot news topics) on the Internet, it is difficult to efficiently and time-savingly understand the development trend and evolution of target news topics from many related news reports. Understanding the evolution of some of the topics on social media has important implications for investors/managers. When the investor/manager understands the meaning of the topic, he or she can make appropriate judgments and take further action accordingly. However, it is difficult to analyze the topic's evolution in time. It is impossible to quickly detect and distinguish the major events, evolutions, etc. contained in each topic and topic, and at the same time, the generation, termination, splitting and merging of topics. There is also no effective identification mechanism.
发明内容Summary of the invention
有鉴于此,本申请提出一种话题演变的可视化展现方法、应用服务器及计算机可读存储介质,可以实现将一事件的话题演变过程进行可视化显示,让用户能够快速地了解和分析整个事件的演变过程。In view of this, the present application proposes a visual presentation method for an item evolution, an application server, and a computer readable storage medium, which can visually display an event evolution process of an event, so that the user can quickly understand and analyze the evolution of the entire event. process.
首先,为实现上述目的,本申请提出一种应用服务器,所述应用服务器包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的话题演变的可视化展现系统,所述话题演变的可视化展现系统被所述处理器执行时实现如下步骤:First, in order to achieve the above object, the present application provides an application server, where the application server includes a memory and a processor, where the memory stores a visual presentation system for topic evolution that can be run on the processor, the topic The evolved visual presentation system implements the following steps when executed by the processor:
提取涉及同一事件的多个文本资料的主题,并确定每一所述主题之间的关联关系,以建立一主题流;Extracting topics related to multiple textual materials of the same event, and determining an association relationship between each of the topics to establish a theme stream;
从多个所述主题中筛选出包含重要事件的多个第一主题;Filtering a plurality of first topics including important events from a plurality of said topics;
提取每一所述第一主题的关键字,并确定每一所述第一主题的关键字的关联关系;及Extracting keywords of each of the first topics, and determining associations of keywords of each of the first topics; and
将每一所述第一主题的关键字及其关联关系添加至所述主题流,以生成与所述多个文本资料对应的话题演变脉络图。Adding keywords of each of the first topics and their associated relationships to the topic stream to generate a topic evolution context map corresponding to the plurality of text materials.
此外,为实现上述目的,本申请还提供一种话题演变的可视化展现方法,应用于应用服务器,所述方法包括:In addition, in order to achieve the above object, the present application further provides a visual presentation method for topic evolution, which is applied to an application server, and the method includes:
提取涉及同一事件的多个文本资料的主题,并确定每一所述主题之间的 关联关系,以建立一主题流;Extracting topics related to multiple textual materials of the same event, and determining an association relationship between each of the topics to establish a theme stream;
从多个所述主题中筛选出包含重要事件的多个第一主题;Filtering a plurality of first topics including important events from a plurality of said topics;
提取每一所述第一主题的关键字,并确定每一所述第一主题的关键字的关联关系;及Extracting keywords of each of the first topics, and determining associations of keywords of each of the first topics; and
将每一所述第一主题的关键字及其关联关系添加至所述主题流,以生成与所述多个文本资料对应的话题演变脉络图。Adding keywords of each of the first topics and their associated relationships to the topic stream to generate a topic evolution context map corresponding to the plurality of text materials.
进一步地,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质存储有话题演变的可视化展现系统,所述话题演变的可视化展现系统可被至少一个处理器执行,以使所述至少一个处理器执行如上述话题演变的可视化展现方法的步骤。Further, in order to achieve the above object, the present application further provides a computer readable storage medium storing a visual presentation system of topic evolution, the visual evolution system of the topic evolution being at least one processor Executing, to cause the at least one processor to perform the steps of the visual presentation method as evolved from the above topic.
相较于现有技术,本申请所提出的话题演变的可视化展现方法、应用服务器及计算机可读存储介质,首先,提取涉及同一事件的多个文本资料的主题,并确定每一所述主题之间的关联关系,以建立一主题流;其次,从多个所述主题中筛选出包含重要事件的多个第一主题;再者,提取每一所述第一主题的关键字,并确定每一所述第一主题的关键字的关联关系;最后,将每一所述第一主题的关键字及其关联关系添加至所述主题流,以生成与所述多个文本资料对应的话题演变脉络图。这样,可以对时序性的社会事件挖掘其主题,并把事件的演变趋势通过随时间变化的主题流可视化地表现出来,使用户能够对话题的演变过程和其中的重大事件有更好的了解,避免由于话题关联引起的话题漂移,实现帮助用户深入地了解话题深层的意义,避免得出错误认知或决断。Compared with the prior art, the visual presentation method, the application server, and the computer readable storage medium of the topic evolution proposed by the present application firstly extract the topics of multiple text materials related to the same event, and determine each of the topics. a relationship between the two to create a theme stream; secondly, selecting a plurality of first topics containing important events from the plurality of the topics; and further, extracting keywords of each of the first topics, and determining each An association relationship of keywords of the first topic; finally, adding keywords of each of the first topics and their associations to the topic stream to generate topic evolution corresponding to the plurality of text materials Context map. In this way, the topic can be mined for sequential social events, and the evolution trend of the event can be visualized through the theme flow over time, enabling users to have a better understanding of the evolution of the topic and the major events. Avoid topic drift caused by topic association, and help users to understand the deep meaning of the topic in depth and avoid misunderstanding or decision.
附图说明DRAWINGS
图1是本申请应用服务器一可选的硬件架构的示意图;1 is a schematic diagram of an optional hardware architecture of an application server of the present application;
图2是本申请话题演变的可视化展现系统第一实施例的程序模块示意图;2 is a schematic diagram of a program module of a first embodiment of a visual presentation system in which the subject matter of the present application evolves;
图3是本申请话题演变的可视化展现系统第二实施例的程序模块示意图;3 is a schematic diagram of a program module of a second embodiment of a visual presentation system of the subject matter evolution of the present application;
图4为本申请话题演变的可视化展现方法第一实施例的实施流程示意图;4 is a schematic flowchart of an implementation process of a first embodiment of a visual display method for a topic evolution of the present application;
图5为本申请话题演变的可视化展现方法第二实施例的实施流程示意图。FIG. 5 is a schematic diagram of an implementation process of a second embodiment of a visual display method for the evolution of the topic of the present application.
附图标记:Reference mark:
应用服务器application server 22
存储器Memory 1111
处理器processor 1212
网络接口Network Interface 1313
话题演变的可视化展现系统Visual representation system for topic evolution 100100
第一提取模块 First extraction module 101101
筛选模块 Screening module 102102
第二提取模块 Second extraction module 103103
生成模块 Build module 104104
标示模块 Marking module 105105
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The implementation, functional features and advantages of the present application will be further described with reference to the accompanying drawings.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the objects, technical solutions, and advantages of the present application more comprehensible, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.
需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。It should be noted that the descriptions of "first", "second" and the like in the present application are for the purpose of description only, and are not to be construed as indicating or implying their relative importance or implicitly indicating the number of technical features indicated. . Thus, features defining "first" or "second" may include at least one of the features, either explicitly or implicitly. In addition, the technical solutions between the various embodiments may be combined with each other, but must be based on the realization of those skilled in the art, and when the combination of the technical solutions is contradictory or impossible to implement, it should be considered that the combination of the technical solutions does not exist. Nor is it within the scope of protection required by this application.
参阅图1所示,是本申请应用服务器2一可选的硬件架构的示意图。Referring to FIG. 1, it is a schematic diagram of an optional hardware architecture of the application server 2 of the present application.
本实施例中,所述应用服务器2可包括,但不仅限于,可通过系统总线相互通信连接存储器11、处理器12、网络接口13。需要指出的是,图1仅示出了具有组件11-13的应用服务器2,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。In this embodiment, the application server 2 may include, but is not limited to, the memory 11, the processor 12, and the network interface 13 being communicably connected to each other through a system bus. It is pointed out that Figure 1 only shows the application server 2 with components 11-13, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.
所述应用服务器2可以是机架式服务器、刀片式服务器、塔式服务器或机柜式服务器等计算设备,该应用服务器2可以是独立的服务器,也可以是多个服务器所组成的服务器集群。The application server 2 may be a computing device such as a rack server, a blade server, a tower server, or a rack server. The application server 2 may be a stand-alone server or a server cluster composed of multiple servers.
所述存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器11可以是所述应用服务器2的内部存储单元,例如该应用服务器2的硬盘或内存。在另一些实施例中,所述存储器11也可以是所述应用服务器2的外部存储设备,例如该应用服务器 2上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器11还可以既包括所述应用服务器2的内部存储单元也包括其外部存储设备。本实施例中,所述存储器11通常用于存储安装于所述应用服务器2的操作系统和各类应用软件,例如话题演变的可视化展现系统100的程序代码等。此外,所述存储器11还可以用于暂时地存储已经输出或者将要输出的各类数据。The memory 11 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (eg, SD or DX memory, etc.), a random access memory (RAM), a static Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, and the like. In some embodiments, the memory 11 may be an internal storage unit of the application server 2, such as a hard disk or memory of the application server 2. In other embodiments, the memory 11 may also be an external storage device of the application server 2, such as a plug-in hard disk equipped on the application server 2, a smart memory card (SMC), and a secure digital number. (Secure Digital, SD) card, flash card, etc. Of course, the memory 11 can also include both the internal storage unit of the application server 2 and its external storage device. In this embodiment, the memory 11 is generally used to store an operating system installed in the application server 2 and various types of application software, such as program code of the visual presentation system 100 of the topic evolution. Further, the memory 11 can also be used to temporarily store various types of data that have been output or are to be output.
所述处理器12在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器12通常用于控制所述应用服务器2的总体操作,例如执行与所述终端设备1进行数据交互或者通信相关的控制和处理等。本实施例中,所述处理器12用于运行所述存储器11中存储的程序代码或者处理数据,例如运行所述话题演变的可视化展现系统100等。The processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 12 is typically used to control the overall operation of the application server 2, such as performing control and processing related to data interaction or communication with the terminal device 1. In this embodiment, the processor 12 is configured to run program code or process data stored in the memory 11, such as a visual presentation system 100 that runs the topic evolution.
所述网络接口13可包括无线网络接口或有线网络接口,该网络接口13通常用于在所述应用服务器2与其他电子设备之间建立通信连接。The network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the application server 2 and other electronic devices.
至此,己经详细介绍了本申请相关设备的硬件结构和功能。下面,将基于上述介绍提出本申请的各个实施例。So far, the hardware structure and functions of the devices related to this application have been described in detail. Hereinafter, various embodiments of the present application will be made based on the above description.
首先,本申请提出一种话题演变的可视化展现系统100。First, the present application proposes a visual presentation system 100 for topic evolution.
参阅图2所示,是本申请话题演变的可视化展现系统100第一实施例的程序模块图。Referring to FIG. 2, it is a program module diagram of the first embodiment of the visual presentation system 100 in which the subject matter of the present application evolves.
本实施例中,所述话题演变的可视化展现系统100包括一系列的存储于存储器11上的计算机程序指令,当该计算机程序指令被处理器12执行时,可以实现本申请各实施例的话题演变的可视化展现操作。在一些实施例中,基于该计算机程序指令各部分所实现的特定的操作,话题演变的可视化展现系统100可以被划分为一个或多个模块。例如,在图2中,话题演变的可视化展现系统100可以被分割成第一提取模块101、筛选模块102、第二提取模块103及生成模块104。其中:In this embodiment, the visual representation system 100 of the topic evolution includes a series of computer program instructions stored in the memory 11, and when the computer program instructions are executed by the processor 12, the topic evolution of the embodiments of the present application can be implemented. Visualization of the operation. In some embodiments, the visual evolution system 100 of topic evolution may be divided into one or more modules based on the particular operations implemented by the various portions of the computer program instructions. For example, in FIG. 2, the visualization of the topic presentation system 100 can be divided into a first extraction module 101, a screening module 102, a second extraction module 103, and a generation module 104. among them:
所述第一提取模块101用于提取涉及同一事件的多个文本资料的主题,并确定每一所述主题之间的关联关系,以建立一主题流。The first extraction module 101 is configured to extract topics related to multiple text materials of the same event, and determine an association relationship between each of the topics to establish a theme stream.
在一实施例中,所述文本资料可以是线上新闻文本,所述第一提取模块101可以通过接入网络来提取涉及同一事件的多个新闻文本。具体地,可以通过输入某一事件的关键字(例如事件的发生地点、主要人物、事由等)来从网络上搜寻并提取涉及该事件的多个新闻文本,再根据提取到得多个新闻文本来提取其主题。所述第一提取模块101可以获取当前新闻文本的人物、地点、事件等要素,并在该些要素的基础上生成一事件摘要作为所述新闻文本的主题。In an embodiment, the text material may be online news text, and the first extraction module 101 may extract a plurality of news texts related to the same event by accessing the network. Specifically, a plurality of news texts related to the event may be searched for and extracted from the network by inputting a keyword of an event (for example, a place where the event occurs, a main character, an event, etc.), and then multiple news texts are extracted according to the event. To extract its theme. The first extraction module 101 may acquire elements such as a person, a place, an event, and the like of the current news text, and generate an event summary as a theme of the news text based on the elements.
在一实施方式中,所述第一提取模块101还用于对所述提取的多个文本资料进行预处理。所述预处理可以包括:对所述文本资料进行切分、繁简转化、替换歧义词、去除停用词、低频词、数字及标点符号等等。In an embodiment, the first extraction module 101 is further configured to preprocess the extracted plurality of text materials. The pre-processing may include: segmenting, simplifying, replacing ambiguous words, removing stop words, low frequency words, numbers, punctuation marks, and the like.
在一实施方式中,所述第一提取模块101可以通过分层狄利克雷过程对每 一主题进行建模,将t时刻到来的第i个文本资料记为
Figure PCTCN2018090694-appb-000001
其所在的簇记为
Figure PCTCN2018090694-appb-000002
如果在两个时间点上,
Figure PCTCN2018090694-appb-000003
的簇标记不同,即
Figure PCTCN2018090694-appb-000004
Figure PCTCN2018090694-appb-000005
不一致,那么就可认为
Figure PCTCN2018090694-appb-000006
的主题发生了改变,以此可以计算出两个量来得出主题的分裂与合并,该两个量分别是从时刻t-1到时刻t,簇r中来自簇s的比例:
In an embodiment, the first extraction module 101 may model each topic by a layered Dirichlet process, and record the ith text data that arrives at time t as
Figure PCTCN2018090694-appb-000001
The cluster in which it is located is
Figure PCTCN2018090694-appb-000002
If at two points in time,
Figure PCTCN2018090694-appb-000003
Different cluster marks, ie
Figure PCTCN2018090694-appb-000004
versus
Figure PCTCN2018090694-appb-000005
Inconsistent, then you can think
Figure PCTCN2018090694-appb-000006
The subject has changed so that two quantities can be calculated to derive the split and merge of the subject. The two quantities are the ratio from cluster s in cluster r from time t-1 to time t:
Figure PCTCN2018090694-appb-000007
Figure PCTCN2018090694-appb-000007
和从时刻t-1到时刻t簇s中流向簇r的比例:And the ratio of flow to cluster r from time t-1 to time t cluster s:
Figure PCTCN2018090694-appb-000008
Figure PCTCN2018090694-appb-000008
在一实施方式中,主题的产生与结束可以通过运用哈希表来进行检测。在哈希表中,每一主题具有唯一的存储位置相对应,进而来通过哈希表检测主题的产生与结束。In an embodiment, the generation and termination of the subject matter can be detected by applying a hash table. In the hash table, each topic has a unique storage location corresponding to the hash table to detect the generation and end of the topic.
在一实施方式中,所述第一提取模块101可以根据每一文本资料的发文时间对每一文本资料的主题进行排序。所述第一提取模块101建立的主题流代表多个主题随着时间的演变,主题流的高度可以代表属于该主题的文档数。主题流也可以分为几个分支,数个分支也可以合并成一个主题。In an embodiment, the first extraction module 101 may sort the topics of each text material according to the posting time of each text material. The theme stream established by the first extraction module 101 represents the evolution of a plurality of topics over time, and the height of the topic stream may represent the number of documents belonging to the topic. The theme stream can also be divided into several branches, and several branches can also be combined into one topic.
所述筛选模块102用于从多个所述主题中筛选出包含重要事件的多个第一主题。The screening module 102 is configured to filter a plurality of first topics including important events from a plurality of the topics.
在一实施方式中,多个第一主题优选为存在分裂、合并的主题。主题的分裂与合并可以用分值进行表示。具体地可以使用信息熵算法来计算分值。存在合并的主题的分值可以通过以下公式进行计算:In an embodiment, the plurality of first topics are preferably subject matter that is split, merged. The splitting and merging of topics can be represented by scores. Specifically, an information entropy algorithm can be used to calculate the score. The scores for the merged topic can be calculated by the following formula:
Figure PCTCN2018090694-appb-000009
Figure PCTCN2018090694-appb-000009
其中,R(r,t)是簇r在时间t的排序分值,N r是流入簇r的元素数量,存在分裂的主题的分值可以通过以下公式进行计算: Where R(r,t) is the ordering score of cluster r at time t, N r is the number of elements flowing into cluster r, and the score of the subject with splitting can be calculated by the following formula:
Figure PCTCN2018090694-appb-000010
Figure PCTCN2018090694-appb-000010
其中,R(s,t)是簇s在时间t的排序分值,N s是流入簇r的元素数量。 Where R(s, t) is the ordering score of cluster s at time t, and N s is the number of elements flowing into cluster r.
所述筛选模块102可以根据计算得到的每一主题的分值,选取分值排序(分值可由大到小进行排列)前列的多个主题作为包含所述重要事件的第一主题。例如,所述筛选模块102选取分值排序前十的主题作为所述第一主题。The screening module 102 may select, according to the calculated scores of each topic, a plurality of topics in the front row of the score sorting (the scores may be arranged from large to small) as the first topic including the important events. For example, the screening module 102 selects the top ten topics of the score ranking as the first topic.
所述第二提取模块103用于提取每一所述第一主题的关键字,并确定每一所述第一主题的关键字的关联关系。The second extraction module 103 is configured to extract keywords of each of the first topics, and determine an association relationship of keywords of each of the first topics.
在一实施方式中,所述第二提取模块103可以使用TF-IDF算法来提取每一所述第一主题的关键字。TF-IDF算法可以用于评估一字词对于一个主题文本中的重要程度。字词的重要性会随着它在文本中出现的次数成正比增加。在进行TF-IDF计算时,通过词频(TF)与逆文档频率(IDF)得出某个字词的TF-IDF值,若该字词对主题文本的重要性越高则该TF-IDF值越大。因此第 二提取模块103可以将TF-IDF值排在最前面的几个字词作为该主题文本的关键词。例如,将TF-IDF值排在前五的字词作为该第一主题的关键词。In an embodiment, the second extraction module 103 may extract a keyword of each of the first topics using a TF-IDF algorithm. The TF-IDF algorithm can be used to assess how important a word is in a subject text. The importance of a word increases proportionally with the number of times it appears in the text. When performing TF-IDF calculation, the TF-IDF value of a certain word is obtained by word frequency (TF) and inverse document frequency (IDF), and the TF-IDF value is higher if the word is more important to the subject text. The bigger. Therefore, the second extraction module 103 can rank the TF-IDF value in the first few words as the keyword of the subject text. For example, a word with the TF-IDF value ranked in the top five is used as the keyword of the first topic.
在一实施方式中,所述第二提取模块103可以通过分层狄利克雷过程确定每一所述第一主题的关键字的关联关系。所述第二提取模块103还可以进一步结合每一所述第一主题在主题流的节点位置来确定每一所述第一主题的关键字的关联关系。In an embodiment, the second extraction module 103 may determine an association relationship of keywords of each of the first topics by a layered Dirichlet process. The second extraction module 103 may further determine the association relationship of the keywords of each of the first topics in combination with each of the first topics at a node location of the topic stream.
所述生成模块104用于将每一所述第一主题的关键字及其关联关系添加至所述主题流,以生成与所述多个文本资料对应的话题演变脉络图。The generating module 104 is configured to add a keyword of each of the first topics and an association relationship thereof to the topic stream to generate a topic evolution context map corresponding to the plurality of text materials.
在一实施方式中,所述生成模块104可以将每一所述第一主题的关键字及其关联关系可视化为词云交叠在所述主题流上。话题演变脉络图可以通过显示模块(图未示)进行显示。In an embodiment, the generating module 104 may visualize keywords of each of the first topics and their associated relationships as word clouds overlapping on the topic stream. The topic evolution context map can be displayed by a display module (not shown).
通过上述程序模块101-104,本申请所提出的话题演变的可视化展现系统100,首先,提取涉及同一事件的多个文本资料的主题,并确定每一所述主题之间的关联关系,以建立一主题流;其次,从多个所述主题中筛选出包含重要事件的多个第一主题;再者,提取每一所述第一主题的关键字,并确定每一所述第一主题的关键字的关联关系;最后,将每一所述第一主题的关键字及其关联关系添加至所述主题流,以生成与所述多个文本资料对应的话题演变脉络图。这样,可以对时序性的社会事件挖掘其主题,并把事件的演变趋势通过随时间变化的主题流可视化地表现出来,使用户能够对话题的演变过程和其中的重大事件有更好的了解,避免由于话题关联引起的话题漂移,实现帮助用户深入地了解话题深层的意义,避免得出错误认知或决断。Through the above-mentioned program modules 101-104, the visual presentation system 100 of the topic evolution proposed by the present application firstly extracts topics of a plurality of text materials related to the same event, and determines an association relationship between each of the topics to establish a topic stream; secondly, filtering a plurality of first topics including important events from a plurality of the topics; further, extracting keywords of each of the first topics, and determining each of the first topics Keyword associations; finally, keywords of each of the first topics and their associated relationships are added to the topic stream to generate a topic evolution context map corresponding to the plurality of text materials. In this way, the topic can be mined for sequential social events, and the evolution trend of the event can be visualized through the theme flow over time, enabling users to have a better understanding of the evolution of the topic and the major events. Avoid topic drift caused by topic association, and help users to understand the deep meaning of the topic in depth and avoid misunderstanding or decision.
参阅图3所示,是本申请话题演变的可视化展现系统100第二实施例的程序模块图。本实施例中,所述话题演变的可视化展现系统100包括一系列的存储于存储器11上的计算机程序指令,当该计算机程序指令被处理器12执行时,可以实现本申请各实施例的话题演变的可视化展现操作。在一些实施例中,基于该计算机程序指令各部分所实现的特定的操作,话题演变的可视化展现系统100可以被划分为一个或多个模块。例如,在图3中,话题演变的可视化展现系统100可以被分割成第一提取模块101、筛选模块102、第二提取模块103、生成模块104及标示模块105。所述各程序模块101-104与本申请话题演变的可视化展现系统100第一实施例相同,并在此基础上增加标示模块105。其中:Referring to FIG. 3, it is a program module diagram of a second embodiment of the visual presentation system 100 of the subject matter of the present application. In this embodiment, the visual representation system 100 of the topic evolution includes a series of computer program instructions stored in the memory 11, and when the computer program instructions are executed by the processor 12, the topic evolution of the embodiments of the present application can be implemented. Visualization of the operation. In some embodiments, the visual evolution system 100 of topic evolution may be divided into one or more modules based on the particular operations implemented by the various portions of the computer program instructions. For example, in FIG. 3, the visual evolution system 100 of the topic evolution may be divided into a first extraction module 101, a screening module 102, a second extraction module 103, a generation module 104, and a labeling module 105. The program modules 101-104 are the same as the first embodiment of the visual presentation system 100 in which the subject matter of the present application evolves, and the labeling module 105 is added thereto. among them:
所述第一提取模块101用于提取涉及同一事件的多个文本资料的主题,并确定每一所述主题之间的关联关系,以建立一主题流。The first extraction module 101 is configured to extract topics related to multiple text materials of the same event, and determine an association relationship between each of the topics to establish a theme stream.
在一实施例中,所述文本资料可以是线上新闻文本,所述第一提取模块101可以通过接入网络来提取涉及同一事件的多个新闻文本。具体地,可以通过输入某一事件的关键字(例如事件的发生地点、主要人物、事由等)来从网络上搜寻并提取涉及该事件的多个新闻文本,再根据提取到得多个新闻文本来提取其主题。所述第一提取模块101可以获取当前新闻文本的人物、地点、事件等要素,并在该些要素的基础上生成一事件摘要作为所述新闻文本的主 题。In an embodiment, the text material may be online news text, and the first extraction module 101 may extract a plurality of news texts related to the same event by accessing the network. Specifically, a plurality of news texts related to the event may be searched for and extracted from the network by inputting a keyword of an event (for example, a place where the event occurs, a main character, an event, etc.), and then multiple news texts are extracted according to the event. To extract its theme. The first extraction module 101 may acquire elements such as characters, places, events, and the like of the current news text, and generate an event summary as the topic of the news text based on the elements.
在一实施方式中,所述第一提取模块101还用于对所述提取的多个文本资料进行预处理。所述预处理可以包括:对所述文本资料进行切分、繁简转化、替换歧义词、去除停用词、低频词、数字及标点符号等等。In an embodiment, the first extraction module 101 is further configured to preprocess the extracted plurality of text materials. The pre-processing may include: segmenting, simplifying, replacing ambiguous words, removing stop words, low frequency words, numbers, punctuation marks, and the like.
在一实施方式中,所述第一提取模块101可以通过分层狄利克雷过程对每一主题进行建模,将t时刻到来的第i个文本资料记为
Figure PCTCN2018090694-appb-000011
其所在的簇记为
Figure PCTCN2018090694-appb-000012
如果在两个时间点上,
Figure PCTCN2018090694-appb-000013
的簇标记不同,即
Figure PCTCN2018090694-appb-000014
Figure PCTCN2018090694-appb-000015
不一致,那么就可认为
Figure PCTCN2018090694-appb-000016
的主题发生了改变,以此可以计算出两个量来得出主题的分裂与合并,该两个量分别是从时刻t-1到时刻t,簇r中来自簇s的比例:
In an embodiment, the first extraction module 101 may model each topic by a layered Dirichlet process, and record the ith text data that arrives at time t as
Figure PCTCN2018090694-appb-000011
The cluster in which it is located is
Figure PCTCN2018090694-appb-000012
If at two points in time,
Figure PCTCN2018090694-appb-000013
Different cluster marks, ie
Figure PCTCN2018090694-appb-000014
versus
Figure PCTCN2018090694-appb-000015
Inconsistent, then you can think
Figure PCTCN2018090694-appb-000016
The subject has changed so that two quantities can be calculated to derive the split and merge of the subject. The two quantities are the ratio from cluster s in cluster r from time t-1 to time t:
Figure PCTCN2018090694-appb-000017
Figure PCTCN2018090694-appb-000017
和从时刻t-1到时刻t簇s中流向簇r的比例:And the ratio of flow to cluster r from time t-1 to time t cluster s:
Figure PCTCN2018090694-appb-000018
Figure PCTCN2018090694-appb-000018
在一实施方式中,主题的产生与结束可以通过运用哈希表来进行检测。在哈希表中,每一主题具有唯一的存储位置相对应,进而来通过哈希表检测主题的产生与结束。In an embodiment, the generation and termination of the subject matter can be detected by applying a hash table. In the hash table, each topic has a unique storage location corresponding to the hash table to detect the generation and end of the topic.
在一实施方式中,所述第一提取模块101可以根据每一文本资料的发文时间对每一文本资料的主题进行排序。所述第一提取模块101建立的主题流代表多个主题随着时间的演变,主题流的高度可以代表属于该主题的文档数。主题流也可以分为几个分支,数个分支也可以合并成一个主题。In an embodiment, the first extraction module 101 may sort the topics of each text material according to the posting time of each text material. The theme stream established by the first extraction module 101 represents the evolution of a plurality of topics over time, and the height of the topic stream may represent the number of documents belonging to the topic. The theme stream can also be divided into several branches, and several branches can also be combined into one topic.
所述标示模块105用于识别每一所述主题的产生、分裂、合并、结束在所述主题流中的节点位置,并对每一所述主题的产生、分裂、合并、结束的节点位置运用不同的标记符号进行标示。例如,使用实心圆圈代表主题的产生,使用空心圆圈代表主题的结束,使用不同角度的三叉标记分别代表主题的分裂和合并。The labeling module 105 is configured to identify, generate, split, merge, and end node locations in the topic stream for each of the topics, and apply the node locations of each of the topics generated, split, merged, and ended. Different marker symbols are marked. For example, a solid circle is used to represent the generation of the theme, an open circle is used to represent the end of the theme, and a three-pronged mark using different angles represents the splitting and merging of the theme, respectively.
在一实施方式中,所述标示模块105可以运用哈希表及分层狄利克雷过程可以识别每一所述主题的产生、分裂、合并、结束在所述主题流中的节点位置,进而可以对每一所述主题的产生、分裂、合并、结束的节点位置运用不同的预设标记符号进行标示。对于分裂和合并的主题,所述标示模块105还可以选用与代表原主题相似的颜色进行标示。In an embodiment, the labeling module 105 can use a hash table and a layered Dirichlet process to identify, generate, split, merge, and end the position of each node in the topic stream. The position of the nodes that generate, split, merge, and end each of the topics is marked with different preset markers. For the subject of splitting and merging, the indicator module 105 can also be marked with a color similar to the original theme.
所述筛选模块102用于从多个所述主题中筛选出包含重要事件的多个第一主题。The screening module 102 is configured to filter a plurality of first topics including important events from a plurality of the topics.
在一实施方式中,多个第一主题优选为存在分裂、合并的主题。主题的分裂与合并可以用分值进行表示。具体地可以使用信息熵算法来计算分值。存在合并的主题的分值可以通过以下公式进行计算:In an embodiment, the plurality of first topics are preferably subject matter that is split, merged. The splitting and merging of topics can be represented by scores. Specifically, an information entropy algorithm can be used to calculate the score. The scores for the merged topic can be calculated by the following formula:
Figure PCTCN2018090694-appb-000019
Figure PCTCN2018090694-appb-000019
其中,R(r,t)是簇r在时间t的排序分值,N r是流入簇r的元素数量,存在分裂的主题的分值可以通过以下公式进行计算: Where R(r,t) is the ordering score of cluster r at time t, N r is the number of elements flowing into cluster r, and the score of the subject with splitting can be calculated by the following formula:
Figure PCTCN2018090694-appb-000020
Figure PCTCN2018090694-appb-000020
其中,R(s,t)是簇s在时间t的排序分值,N s是流入簇r的元素数量。 Where R(s, t) is the ordering score of cluster s at time t, and N s is the number of elements flowing into cluster r.
所述筛选模块102可以根据计算得到的每一主题的分值,选取分值排序(分值可由大到小进行排列)前列的多个主题作为包含所述重要事件的第一主题。例如,所述筛选模块102选取分值排序前十的主题作为所述第一主题。所述第一主题也可在所述主题流上运用特定的颜色或标记符号进行标示。The screening module 102 may select, according to the calculated scores of each topic, a plurality of topics in the front row of the score sorting (the scores may be arranged from large to small) as the first topic including the important events. For example, the screening module 102 selects the top ten topics of the score ranking as the first topic. The first subject matter may also be labeled with a particular color or indicia on the subject stream.
所述第二提取模块103用于提取每一所述第一主题的关键字,并确定每一所述第一主题的关键字的关联关系。The second extraction module 103 is configured to extract keywords of each of the first topics, and determine an association relationship of keywords of each of the first topics.
在一实施方式中,所述第二提取模块103可以使用TF-IDF算法来提取每一所述第一主题的关键字。TF-IDF算法可以用于评估一字词对于一个主题文本中的重要程度。字词的重要性会随着它在文本中出现的次数成正比增加。在进行TF-IDF计算时,通过词频(TF)与逆文档频率(IDF)得出某个字词的TF-IDF值,若该字词对主题文本的重要性越高则该TF-IDF值越大。因此第二提取模块103可以将TF-IDF值排在最前面的几个字词作为该主题文本的关键词。例如,将TF-IDF值排在前五的字词作为该第一主题的关键词。In an embodiment, the second extraction module 103 may extract a keyword of each of the first topics using a TF-IDF algorithm. The TF-IDF algorithm can be used to assess how important a word is in a subject text. The importance of a word increases proportionally with the number of times it appears in the text. When performing TF-IDF calculation, the TF-IDF value of a certain word is obtained by word frequency (TF) and inverse document frequency (IDF), and the TF-IDF value is higher if the word is more important to the subject text. The bigger. Therefore, the second extraction module 103 can classify the first few words of the TF-IDF value as keywords of the topic text. For example, a word with the TF-IDF value ranked in the top five is used as the keyword of the first topic.
在一实施方式中,所述第二提取模块103可以通过分层狄利克雷过程确定每一所述第一主题的关键字的关联关系。所述第二提取模块103还可以进一步结合每一所述第一主题在主题流的节点位置来确定每一所述第一主题的关键字的关联关系。In an embodiment, the second extraction module 103 may determine an association relationship of keywords of each of the first topics by a layered Dirichlet process. The second extraction module 103 may further determine the association relationship of the keywords of each of the first topics in combination with each of the first topics at a node location of the topic stream.
所述生成模块104用于将每一所述第一主题的关键字及其关联关系添加至所述主题流,以生成与所述多个文本资料对应的话题演变脉络图。The generating module 104 is configured to add a keyword of each of the first topics and an association relationship thereof to the topic stream to generate a topic evolution context map corresponding to the plurality of text materials.
在一实施方式中,所述生成模块104可以将每一所述第一主题的关键字及其关联关系可视化为词云交叠在所述主题流上。话题演变脉络图可以通过显示模块(例如投影屏、显示器等)进行显示。In an embodiment, the generating module 104 may visualize keywords of each of the first topics and their associated relationships as word clouds overlapping on the topic stream. The topic evolution context map can be displayed by a display module (eg, a projection screen, a display, etc.).
通过上述程序模块101-105,本申请所提出的话题演变的可视化展现系统100,首先,提取涉及同一事件的多个文本资料的主题,并确定每一所述主题之间的关联关系,以建立一主题流;其次,识别每一所述主题的产生、分裂、合并、结束在所述主题流中的节点位置,并对每一所述主题的产生、分裂、合并、结束的节点位置运用不同的标记符号进行标示;再者,从多个所述主题中筛选出包含重要事件的多个第一主题;再者,提取每一所述第一主题的关键字,并确定每一所述第一主题的关键字的关联关系;最后,将每一所述第一主题的关键字及其关联关系添加至所述主题流,以生成与所述多个文本资料对应的话题演变脉络图。这样,可以对时序性的社会事件挖掘其主题,并把事件的演变趋势通过随时间变化的主题流可视化地表现出来,使用户能够对话题的演变过程和其中的重大事件有更好的了解,避免由于话题关联引起的话题漂移,实现帮助用户深入地了解话题深层的意义,避免得出错误认 知或决断。Through the above program modules 101-105, the visual presentation system 100 of the topic evolution proposed by the present application firstly extracts topics of a plurality of text materials related to the same event, and determines an association relationship between each of the topics to establish a topic stream; secondly, identifying, generating, splitting, merging, ending the node locations in the topic stream for each of the topics, and applying different node locations for each of the topics generated, split, merged, and ended Marking symbols are marked; further, selecting a plurality of first topics including important events from the plurality of the topics; further, extracting keywords of each of the first topics, and determining each of the The association relationship of the keywords of a topic; finally, the keywords of each of the first topics and their associated relationships are added to the topic stream to generate a topic evolution context map corresponding to the plurality of text materials. In this way, the topic can be mined for sequential social events, and the evolution trend of the event can be visualized through the theme flow over time, enabling users to have a better understanding of the evolution of the topic and the major events. Avoid topic drift caused by topic association, and help users to understand the deep meaning of the topic in depth and avoid misunderstanding or decision.
此外,本申请还提出一种话题演变的可视化展现方法。In addition, the present application also proposes a visual display method for topic evolution.
参阅图4所示,是本申请话题演变的可视化展现方法第一实施例的实施流程示意图。在本实施例中,根据不同的需求,图4所示的流程图中的步骤的执行顺序可以改变,某些步骤可以省略。Referring to FIG. 4, it is a schematic flowchart of the implementation of the first embodiment of the visual display method for the evolution of the topic of the present application. In this embodiment, the order of execution of the steps in the flowchart shown in FIG. 4 may be changed according to different requirements, and some steps may be omitted.
步骤S500,提取涉及同一事件的多个文本资料的主题,并确定每一所述主题之间的关联关系,以建立一主题流。Step S500, extracting topics of a plurality of text materials related to the same event, and determining an association relationship between each of the topics to establish a theme stream.
在一实施例中,所述文本资料可以是线上新闻文本,可以通过接入网络来提取涉及同一事件的多个新闻文本。具体地,可以通过输入某一事件的关键字(例如事件的发生地点、主要人物、事由等)来从网络上搜寻并提取涉及该事件的多个新闻文本,再根据提取到得多个新闻文本来提取其主题。In an embodiment, the text material may be online news text, and multiple news texts related to the same event may be extracted through the access network. Specifically, a plurality of news texts related to the event may be searched for and extracted from the network by inputting a keyword of an event (for example, a place where the event occurs, a main character, an event, etc.), and then multiple news texts are extracted according to the event. To extract its theme.
在一实施方式中,可以通过获取当前新闻文本的人物、地点、事件等要素,并在该些要素的基础上生成一事件摘要作为所述新闻文本的主题。In an embodiment, an event summary may be generated as a subject of the news text by acquiring elements such as a person, a place, an event, and the like of the current news text.
在一实施方式中,可以在提取文本资料主题之前对所述提取的多个文本资料进行预处理。所述预处理可以包括:对所述文本资料进行切分、繁简转化、替换歧义词、去除停用词、低频词、数字及标点符号等等。In an embodiment, the extracted plurality of text materials may be pre-processed prior to extracting the text material theme. The pre-processing may include: segmenting, simplifying, replacing ambiguous words, removing stop words, low frequency words, numbers, punctuation marks, and the like.
在一实施方式中,可以通过分层狄利克雷过程对每一主题进行建模,将t时刻到来的第i个文本资料记为
Figure PCTCN2018090694-appb-000021
其所在的簇记为
Figure PCTCN2018090694-appb-000022
如果在两个时间点上,
Figure PCTCN2018090694-appb-000023
的簇标记不同,即
Figure PCTCN2018090694-appb-000024
Figure PCTCN2018090694-appb-000025
不一致,那么就可认为
Figure PCTCN2018090694-appb-000026
的主题发生了改变,以此可以计算出两个量来得出主题的分裂与合并,该两个量分别是从时刻t-1到时刻t,簇r中来自簇s的比例:
In one embodiment, each topic can be modeled by a layered Dirichlet process, and the ith text data arriving at time t is recorded as
Figure PCTCN2018090694-appb-000021
The cluster in which it is located is
Figure PCTCN2018090694-appb-000022
If at two points in time,
Figure PCTCN2018090694-appb-000023
Different cluster marks, ie
Figure PCTCN2018090694-appb-000024
versus
Figure PCTCN2018090694-appb-000025
Inconsistent, then you can think
Figure PCTCN2018090694-appb-000026
The subject has changed so that two quantities can be calculated to derive the split and merge of the subject. The two quantities are the ratio from cluster s in cluster r from time t-1 to time t:
Figure PCTCN2018090694-appb-000027
Figure PCTCN2018090694-appb-000027
和从时刻t-1到时刻t簇s中流向簇r的比例:And the ratio of flow to cluster r from time t-1 to time t cluster s:
Figure PCTCN2018090694-appb-000028
Figure PCTCN2018090694-appb-000028
在一实施方式中,主题的产生与结束可以通过运用哈希表来进行检测。在哈希表中,每一主题具有唯一的存储位置相对应,进而来通过哈希表检测主题的产生与结束。In an embodiment, the generation and termination of the subject matter can be detected by applying a hash table. In the hash table, each topic has a unique storage location corresponding to the hash table to detect the generation and end of the topic.
在一实施方式中,可以根据每一文本资料的发文时间对每一文本资料的主题进行排序。建立的主题流可以代表多个主题随着时间的演变,主题流的高度可以代表属于该主题的文档数。主题流也可以分为几个分支,数个分支也可以合并成一个主题。In an embodiment, the topics of each text material may be ordered according to the posting time of each text material. The created topic stream can represent the evolution of multiple topics over time, and the height of the topic stream can represent the number of documents belonging to that topic. The theme stream can also be divided into several branches, and several branches can also be combined into one topic.
步骤S502,从多个所述主题中筛选出包含重要事件的多个第一主题。Step S502, selecting a plurality of first topics including important events from the plurality of the topics.
在一实施方式中,多个第一主题优选为存在分裂、合并的主题。主题的分裂与合并可以用分值进行表示。具体地可以使用信息熵算法来计算分值。存在合并的主题的分值可以通过以下公式进行计算:In an embodiment, the plurality of first topics are preferably subject matter that is split, merged. The splitting and merging of topics can be represented by scores. Specifically, an information entropy algorithm can be used to calculate the score. The scores for the merged topic can be calculated by the following formula:
Figure PCTCN2018090694-appb-000029
Figure PCTCN2018090694-appb-000029
其中,R(r,t)是簇r在时间t的排序分值,N r是流入簇r的元素数量,存在分裂的主题的分值可以通过以下公式进行计算: Where R(r,t) is the ordering score of cluster r at time t, N r is the number of elements flowing into cluster r, and the score of the subject with splitting can be calculated by the following formula:
Figure PCTCN2018090694-appb-000030
Figure PCTCN2018090694-appb-000030
其中,R(s,t)是簇s在时间t的排序分值,N s是流入簇r的元素数量。 Where R(s, t) is the ordering score of cluster s at time t, and N s is the number of elements flowing into cluster r.
在一实施方式总,可以根据计算得到的每一主题的分值,选取分值排序(分值可由大到小进行排列)前列的多个主题作为包含所述重要事件的第一主题。例如,选取分值排序前十的主题作为所述第一主题。In an embodiment, a plurality of topics in the front row of the score sorting (the scores may be arranged from large to small) may be selected as the first topic including the important event according to the calculated score of each topic. For example, the topic of the top ten is sorted by the score as the first topic.
步骤S504,提取每一所述第一主题的关键字,并确定每一所述第一主题的关键字的关联关系。Step S504: Extract keywords of each of the first topics, and determine an association relationship of keywords of each of the first topics.
在一实施方式中,可以使用TF-IDF算法来提取每一所述第一主题的关键字。TF-IDF算法可以用于评估一字词对于一个主题文本中的重要程度。字词的重要性会随着它在文本中出现的次数成正比增加。在进行TF-IDF计算时,通过词频(TF)与逆文档频率(IDF)得出某个字词的TF-IDF值,若该字词对主题文本的重要性越高则该TF-IDF值越大。可以将TF-IDF值排在最前面的几个字词作为该主题文本的关键词。例如,将TF-IDF值排在前五的字词作为该第一主题的关键词。In an embodiment, a TF-IDF algorithm may be used to extract keywords for each of the first topics. The TF-IDF algorithm can be used to assess how important a word is in a subject text. The importance of a word increases proportionally with the number of times it appears in the text. When performing TF-IDF calculation, the TF-IDF value of a certain word is obtained by word frequency (TF) and inverse document frequency (IDF), and the TF-IDF value is higher if the word is more important to the subject text. The bigger. The first few words of the TF-IDF value can be used as keywords for the subject text. For example, a word with the TF-IDF value ranked in the top five is used as the keyword of the first topic.
在一实施方式中,还可以通过分层狄利克雷过程确定每一所述第一主题的关键字的关联关系。In an embodiment, the association relationship of keywords of each of the first topics may also be determined by a layered Dirichlet process.
在一实施方式中,还可以进一步结合每一所述第一主题在主题流的节点位置来确定每一所述第一主题的关键字的关联关系。In an embodiment, the association relationship of the keywords of each of the first topics may be further determined by combining the node locations of the topic streams in each of the first topics.
步骤S506,将每一所述第一主题的关键字及其关联关系添加至所述主题流,以生成与所述多个文本资料对应的话题演变脉络图。Step S506, adding keywords of each of the first topics and their associated relationships to the topic stream to generate a topic evolution context map corresponding to the plurality of text materials.
在一实施方式中,可以将每一所述第一主题的关键字及其关联关系可视化为词云交叠在所述主题流上。话题演变脉络图可以通过投影屏、显示器等设备进行显示。In an embodiment, the keywords of each of the first topics and their associated relationships may be visualized as word clouds overlapping on the topic stream. The topic evolution map can be displayed through projection screens, displays, and other devices.
通过上述步骤S500-S506,本申请所提出的话题演变的可视化展现方法,首先,提取涉及同一事件的多个文本资料的主题,并确定每一所述主题之间的关联关系,以建立一主题流;其次,从多个所述主题中筛选出包含重要事件的多个第一主题;再者,提取每一所述第一主题的关键字,并确定每一所述第一主题的关键字的关联关系;最后,将每一所述第一主题的关键字及其关联关系添加至所述主题流,以生成与所述多个文本资料对应的话题演变脉络图。这样,可以对时序性的社会事件挖掘其主题,并把事件的演变趋势通过随时间变化的主题流可视化地表现出来,使用户能够对话题的演变过程和其中的重大事件有更好的了解,避免由于话题关联引起的话题漂移,实现帮助用户深入地了解话题深层的意义,避免得出错误认知或决断。Through the above steps S500-S506, the visual presentation method of the topic evolution proposed by the present application firstly extracts the topics of multiple text materials related to the same event, and determines the association relationship between each of the topics to establish a theme. Flowing; secondly, filtering a plurality of first topics including important events from a plurality of the topics; further, extracting keywords of each of the first topics, and determining keywords of each of the first topics Finally, the keyword of each of the first topics and its associated relationship are added to the topic stream to generate a topic evolution context map corresponding to the plurality of text materials. In this way, the topic can be mined for sequential social events, and the evolution trend of the event can be visualized through the theme flow over time, enabling users to have a better understanding of the evolution of the topic and the major events. Avoid topic drift caused by topic association, and help users to understand the deep meaning of the topic in depth and avoid misunderstanding or decision.
参阅图5所示,是本申请话题演变的可视化展现方法第二实施例的实施流 程示意图。在本实施例中,根据不同的需求,图5所示的流程图中的步骤的执行顺序可以改变,某些步骤可以省略。Referring to Figure 5, there is shown a flow chart of the implementation of the second embodiment of the visual presentation method of the subject matter evolution. In this embodiment, the order of execution of the steps in the flowchart shown in FIG. 5 may be changed according to different requirements, and some steps may be omitted.
步骤S500,提取涉及同一事件的多个文本资料的主题,并确定每一所述主题之间的关联关系,以建立一主题流。Step S500, extracting topics of a plurality of text materials related to the same event, and determining an association relationship between each of the topics to establish a theme stream.
在一实施例中,所述文本资料可以是线上新闻文本,可以通过接入网络来提取涉及同一事件的多个新闻文本。具体地,可以通过输入某一事件的关键字(例如事件的发生地点、主要人物、事由等)来从网络上搜寻并提取涉及该事件的多个新闻文本,再根据提取到得多个新闻文本来提取其主题。In an embodiment, the text material may be online news text, and multiple news texts related to the same event may be extracted through the access network. Specifically, a plurality of news texts related to the event may be searched for and extracted from the network by inputting a keyword of an event (for example, a place where the event occurs, a main character, an event, etc.), and then multiple news texts are extracted according to the event. To extract its theme.
在一实施方式中,可以通过获取当前新闻文本的人物、地点、事件等要素,并在该些要素的基础上生成一事件摘要作为所述新闻文本的主题。In an embodiment, an event summary may be generated as a subject of the news text by acquiring elements such as a person, a place, an event, and the like of the current news text.
在一实施方式中,可以在提取文本资料主题之前对所述提取的多个文本资料进行预处理。所述预处理可以包括:对所述文本资料进行切分、繁简转化、替换歧义词、去除停用词、低频词、数字及标点符号等等。In an embodiment, the extracted plurality of text materials may be pre-processed prior to extracting the text material theme. The pre-processing may include: segmenting, simplifying, replacing ambiguous words, removing stop words, low frequency words, numbers, punctuation marks, and the like.
在一实施方式中,可以通过分层狄利克雷过程对每一主题进行建模,将t时刻到来的第i个文本资料记为
Figure PCTCN2018090694-appb-000031
其所在的簇记为
Figure PCTCN2018090694-appb-000032
如果在两个时间点上,
Figure PCTCN2018090694-appb-000033
的簇标记不同,即
Figure PCTCN2018090694-appb-000034
Figure PCTCN2018090694-appb-000035
不一致,那么就可认为
Figure PCTCN2018090694-appb-000036
的主题发生了改变,以此可以计算出两个量来得出主题的分裂与合并,该两个量分别是从时刻t-1到时刻t,簇r中来自簇s的比例:
In one embodiment, each topic can be modeled by a layered Dirichlet process, and the ith text data arriving at time t is recorded as
Figure PCTCN2018090694-appb-000031
The cluster in which it is located is
Figure PCTCN2018090694-appb-000032
If at two points in time,
Figure PCTCN2018090694-appb-000033
Different cluster marks, ie
Figure PCTCN2018090694-appb-000034
versus
Figure PCTCN2018090694-appb-000035
Inconsistent, then you can think
Figure PCTCN2018090694-appb-000036
The subject has changed so that two quantities can be calculated to derive the split and merge of the subject. The two quantities are the ratio from cluster s in cluster r from time t-1 to time t:
Figure PCTCN2018090694-appb-000037
Figure PCTCN2018090694-appb-000037
和从时刻t-1到时刻t簇s中流向簇r的比例:And the ratio of flow to cluster r from time t-1 to time t cluster s:
Figure PCTCN2018090694-appb-000038
Figure PCTCN2018090694-appb-000038
在一实施方式中,主题的产生与结束可以通过运用哈希表来进行检测。在哈希表中,每一主题具有唯一的存储位置相对应,进而来通过哈希表检测主题的产生与结束。In an embodiment, the generation and termination of the subject matter can be detected by applying a hash table. In the hash table, each topic has a unique storage location corresponding to the hash table to detect the generation and end of the topic.
在一实施方式中,可以根据每一文本资料的发文时间对每一文本资料的主题进行排序。建立的主题流可以代表多个主题随着时间的演变,主题流的高度可以代表属于该主题的文档数。主题流也可以分为几个分支,数个分支也可以合并成一个主题。In an embodiment, the topics of each text material may be ordered according to the posting time of each text material. The created topic stream can represent the evolution of multiple topics over time, and the height of the topic stream can represent the number of documents belonging to that topic. The theme stream can also be divided into several branches, and several branches can also be combined into one topic.
步骤S508,识别每一所述主题的产生、分裂、合并、结束在所述主题流中的节点位置,并对每一所述主题的产生、分裂、合并、结束的节点位置运用不同的标记符号进行标示。例如,使用实心圆圈代表主题的产生,使用空心圆圈代表主题的结束,使用不同角度的三叉标记分别代表主题的分裂和合并。Step S508, identifying, generating, splitting, merging, and ending the node positions in the topic stream for each of the topics, and applying different mark symbols to the node positions of each of the topics generated, split, merged, and ended. Mark it. For example, a solid circle is used to represent the generation of the theme, an open circle is used to represent the end of the theme, and a three-pronged mark using different angles represents the splitting and merging of the theme, respectively.
在一实施方式中,可以运用哈希表及分层狄利克雷过程可以识别每一所述主题的产生、分裂、合并、结束在所述主题流中的节点位置,进而可以对每一所述主题的产生、分裂、合并、结束的节点位置运用不同的预设标记符 号进行标示。对于分裂和合并的主题,还可以选用与代表原主题相似的颜色进行标示。In an embodiment, the hash table and the hierarchical Dirichlet process may be used to identify the generation, splitting, merging, and ending of each of the topics in the topic stream, and thus each of the The position of the nodes that generate, split, merge, and end the theme is marked with different preset markers. For split and merged themes, you can also choose a color that is similar to the original theme.
步骤S502,从多个所述主题中筛选出包含重要事件的多个第一主题。Step S502, selecting a plurality of first topics including important events from the plurality of the topics.
在一实施方式中,多个第一主题优选为存在分裂、合并的主题。主题的分裂与合并可以用分值进行表示。具体地可以使用信息熵算法来计算分值。存在合并的主题的分值可以通过以下公式进行计算:In an embodiment, the plurality of first topics are preferably subject matter that is split, merged. The splitting and merging of topics can be represented by scores. Specifically, an information entropy algorithm can be used to calculate the score. The scores for the merged topic can be calculated by the following formula:
Figure PCTCN2018090694-appb-000039
Figure PCTCN2018090694-appb-000039
其中,R(r,t)是簇r在时间t的排序分值,N r是流入簇r的元素数量,存在分裂的主题的分值可以通过以下公式进行计算: Where R(r,t) is the ordering score of cluster r at time t, N r is the number of elements flowing into cluster r, and the score of the subject with splitting can be calculated by the following formula:
Figure PCTCN2018090694-appb-000040
Figure PCTCN2018090694-appb-000040
其中,R(s,t)是簇s在时间t的排序分值,N s是流入簇r的元素数量。 Where R(s, t) is the ordering score of cluster s at time t, and N s is the number of elements flowing into cluster r.
在一实施方式总,可以根据计算得到的每一主题的分值,选取分值排序(分值可由大到小进行排列)前列的多个主题作为包含所述重要事件的第一主题。例如,选取分值排序前十的主题作为所述第一主题。所述第一主题也可在所述主题流上运用特定的颜色或标记符号进行标示。In an embodiment, a plurality of topics in the front row of the score sorting (the scores may be arranged from large to small) may be selected as the first topic including the important event according to the calculated score of each topic. For example, the topic of the top ten is sorted by the score as the first topic. The first subject matter may also be labeled with a particular color or indicia on the subject stream.
步骤S504,提取每一所述第一主题的关键字,并确定每一所述第一主题的关键字的关联关系。Step S504: Extract keywords of each of the first topics, and determine an association relationship of keywords of each of the first topics.
在一实施方式中,可以使用TF-IDF算法来提取每一所述第一主题的关键字。TF-IDF算法可以用于评估一字词对于一个主题文本中的重要程度。字词的重要性会随着它在文本中出现的次数成正比增加。在进行TF-IDF计算时,通过词频(TF)与逆文档频率(IDF)得出某个字词的TF-IDF值,若该字词对主题文本的重要性越高则该TF-IDF值越大。可以将TF-IDF值排在最前面的几个字词作为该主题文本的关键词。例如,将TF-IDF值排在前五的字词作为该第一主题的关键词。In an embodiment, a TF-IDF algorithm may be used to extract keywords for each of the first topics. The TF-IDF algorithm can be used to assess how important a word is in a subject text. The importance of a word increases proportionally with the number of times it appears in the text. When performing TF-IDF calculation, the TF-IDF value of a certain word is obtained by word frequency (TF) and inverse document frequency (IDF), and the TF-IDF value is higher if the word is more important to the subject text. The bigger. The first few words of the TF-IDF value can be used as keywords for the subject text. For example, a word with the TF-IDF value ranked in the top five is used as the keyword of the first topic.
在一实施方式中,还可以通过分层狄利克雷过程确定每一所述第一主题的关键字的关联关系。In an embodiment, the association relationship of keywords of each of the first topics may also be determined by a layered Dirichlet process.
在一实施方式中,还可以进一步结合每一所述第一主题在主题流的节点位置来确定每一所述第一主题的关键字的关联关系。In an embodiment, the association relationship of the keywords of each of the first topics may be further determined by combining the node locations of the topic streams in each of the first topics.
步骤S506,将每一所述第一主题的关键字及其关联关系添加至所述主题流,以生成与所述多个文本资料对应的话题演变脉络图。Step S506, adding keywords of each of the first topics and their associated relationships to the topic stream to generate a topic evolution context map corresponding to the plurality of text materials.
在一实施方式中,可以将每一所述第一主题的关键字及其关联关系可视化为词云交叠在所述主题流上。话题演变脉络图可以通过投影屏、显示器等设备进行显示。In an embodiment, the keywords of each of the first topics and their associated relationships may be visualized as word clouds overlapping on the topic stream. The topic evolution map can be displayed through projection screens, displays, and other devices.
通过上述步骤S500-S508,本申请所提出的话题演变的可视化展现方法,首先,提取涉及同一事件的多个文本资料的主题,并确定每一所述主题之间的关联关系,以建立一主题流;其次,识别每一所述主题的产生、分裂、合并、结束在所述主题流中的节点位置,并对每一所述主题的产生、分裂、合 并、结束的节点位置运用不同的标记符号进行标示;再者,从多个所述主题中筛选出包含重要事件的多个第一主题;再者,提取每一所述第一主题的关键字,并确定每一所述第一主题的关键字的关联关系;最后,将每一所述第一主题的关键字及其关联关系添加至所述主题流,以生成与所述多个文本资料对应的话题演变脉络图。这样,可以对时序性的社会事件挖掘其主题,并把事件的演变趋势通过随时间变化的主题流可视化地表现出来,使用户能够对话题的演变过程和其中的重大事件有更好的了解,避免由于话题关联引起的话题漂移,实现帮助用户深入地了解话题深层的意义,避免得出错误认知或决断。Through the above steps S500-S508, the visual presentation method of the topic evolution proposed by the present application firstly extracts the topics of multiple text materials related to the same event, and determines the association relationship between each of the topics to establish a theme. Streaming; secondly, identifying, generating, splitting, merging, ending the node locations in the topic stream for each of the topics, and applying different markers to the node locations of each of the topics generated, split, merged, and ended Symbols are marked; further, a plurality of first topics including important events are filtered from a plurality of the topics; and further, keywords of each of the first topics are extracted, and each of the first topics is determined The association of the keywords; finally, adding the keywords of each of the first topics and their associated relationships to the topic stream to generate a topic evolution context map corresponding to the plurality of text materials. In this way, the topic can be mined for sequential social events, and the evolution trend of the event can be visualized through the theme flow over time, enabling users to have a better understanding of the evolution of the topic and the major events. Avoid topic drift caused by topic association, and help users to understand the deep meaning of the topic in depth and avoid misunderstanding or decision.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the embodiments of the present application are merely for the description, and do not represent the advantages and disadvantages of the embodiments.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better. Implementation. Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, The optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above is only a preferred embodiment of the present application, and is not intended to limit the scope of the patent application, and the equivalent structure or equivalent process transformations made by the specification and the drawings of the present application, or directly or indirectly applied to other related technical fields. The same is included in the scope of patent protection of this application.

Claims (20)

  1. 一种话题演变的可视化展现方法,应用于应用服务器,其特征在于,所述方法包括:A method for visualizing the evolution of a topic, applied to an application server, characterized in that the method comprises:
    提取涉及同一事件的多个文本资料的主题,并确定每一所述主题之间的关联关系,以建立一主题流;Extracting topics related to multiple textual materials of the same event, and determining an association relationship between each of the topics to establish a theme stream;
    从多个所述主题中筛选出包含重要事件的多个第一主题;Filtering a plurality of first topics including important events from a plurality of said topics;
    提取每一所述第一主题的关键字,并确定每一所述第一主题的关键字的关联关系;及Extracting keywords of each of the first topics, and determining associations of keywords of each of the first topics; and
    将每一所述第一主题的关键字及其关联关系添加至所述主题流,以生成与所述多个文本资料对应的话题演变脉络图。Adding keywords of each of the first topics and their associated relationships to the topic stream to generate a topic evolution context map corresponding to the plurality of text materials.
  2. 如权利要求1所述的可视化展现方法,其特征在于,所述可视化展现方法还包括:The visual presentation method of claim 1, wherein the visual presentation method further comprises:
    对所述多个文本资料进行预处理,所述预处理包括:对所述文本资料进行切分、繁简转化、替换歧义词、去除停用词、低频词、数字及标点符号。Pre-processing the plurality of text materials, the pre-processing comprising: segmenting, simplifying, replacing ambiguous words, removing stop words, low frequency words, numbers, and punctuation marks.
  3. 如权利要求1所述的可视化展现方法,其特征在于,所述建立所述主题流的步骤之后还包括:The visual presentation method according to claim 1, wherein the step of establishing the theme stream further comprises:
    识别每一所述主题的产生、分裂、合并、结束在所述主题流中的节点位置;及Identifying, generating, splitting, merging, and ending the node locations in the topic stream for each of the topics; and
    对每一所述主题的产生、分裂、合并、结束的节点位置运用不同的标记符号进行标示。The position of the nodes that generate, split, merge, and end each of the topics is marked with different notation symbols.
  4. 根据权利要求1-3任一项所述的可视化展现方法,其特征在于,所述确定每一所述主题之间的关联关系,以建立一主题流的步骤包括:The visual presentation method according to any one of claims 1-3, wherein the step of determining an association relationship between each of the topics to establish a topic stream comprises:
    通过分层狄利克雷过程确定每一所述主题之间的关联关系,以建立所述主题流;Determining an association relationship between each of the topics by a layered Dirichlet process to establish the topic stream;
    其中,所述分层狄利克雷过程包括计算从时刻t-1到时刻t,簇r中来自簇s的比例,及从时刻t-1到时刻t,簇s中流向簇r的比例,以确定每一所述主题之间的关联关系,将t时刻到来的第i个资料记为
    Figure PCTCN2018090694-appb-100001
    其所在的簇记为
    Figure PCTCN2018090694-appb-100002
    所述簇r中来所述自簇s的比例通过以下公式计算得到:
    The hierarchical Dirichlet process includes calculating a ratio from the cluster t to the cluster s from the time t-1 to the time t, and the ratio of the cluster s to the cluster r from the time t-1 to the time t, Determine the relationship between each of the topics, and record the ith data coming from time t as
    Figure PCTCN2018090694-appb-100001
    The cluster in which it is located is
    Figure PCTCN2018090694-appb-100002
    The ratio of the self-cluster s in the cluster r is calculated by the following formula:
    Figure PCTCN2018090694-appb-100003
    Figure PCTCN2018090694-appb-100003
    所述簇s中流向所述簇r的比例通过以下公式计算得到:The ratio of the clusters s flowing to the cluster r is calculated by the following formula:
    Figure PCTCN2018090694-appb-100004
    Figure PCTCN2018090694-appb-100004
  5. 根据权利要求1-3任一项所述的可视化展现方法,其特征在于,所述从多个所述主题中筛选出包含重要事件的多个第一主题的步骤包括:The visual presentation method according to any one of claims 1-3, wherein the step of filtering out a plurality of first topics including important events from the plurality of the topics comprises:
    利用信息熵算法来计算每一所述主题的分值;及Using an information entropy algorithm to calculate a score for each of the topics; and
    根据计算得到的分值大小来从多个所述主题中筛选出包含重要事件的多个所述第一主题;Extracting a plurality of the first topics including important events from a plurality of the topics according to the calculated score size;
    其中,所述信息熵算法的计算公式为:Wherein, the calculation formula of the information entropy algorithm is:
    Figure PCTCN2018090694-appb-100005
    Figure PCTCN2018090694-appb-100005
    R(r,t)是簇r在时间t的排序分值,N r是流入簇r的元素数量。 R(r,t) is the ordering score of cluster r at time t, and N r is the number of elements flowing into cluster r.
  6. 根据权利要求4所述的可视化展现方法,其特征在于,所述提取每一所述第一主题的关键字,并确定每一所述第一主题的关键字的关联关系的步骤包括:The visual presentation method according to claim 4, wherein the step of extracting keywords of each of the first topics and determining an association relationship of keywords of each of the first topics comprises:
    利用TF-IDF算法提取每一所述第一主题的关键字;及Extracting keywords of each of the first topics by using a TF-IDF algorithm; and
    通过分层狄利克雷过程确定每一所述第一主题的关键字的关联关系。The association relationship of the keywords of each of the first topics is determined by a layered Dirichlet process.
  7. 根据权利要求5所述的可视化展现方法,其特征在于,所述提取每一所述第一主题的关键字,并确定每一所述第一主题的关键字的关联关系的步骤包括:The visual presentation method according to claim 5, wherein the step of extracting keywords of each of the first topics and determining an association relationship of keywords of each of the first topics comprises:
    利用TF-IDF算法提取每一所述第一主题的关键字;及Extracting keywords of each of the first topics by using a TF-IDF algorithm; and
    通过分层狄利克雷过程确定每一所述第一主题的关键字的关联关系。The association relationship of the keywords of each of the first topics is determined by a layered Dirichlet process.
  8. 一种应用服务器,其特征在于,所述应用服务器包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的话题演变的可视化展现系统,所述话题演变的可视化展现系统被所述处理器执行时实现如下步骤:An application server, comprising: a memory, a processor, wherein the memory stores a visual presentation system that can evolve on a topic running on the processor, and the visual presentation system of the topic evolution is The processor implements the following steps when executed:
    提取涉及同一事件的多个文本资料的主题,并确定每一所述主题之间的关联关系,以建立一主题流;Extracting topics related to multiple textual materials of the same event, and determining an association relationship between each of the topics to establish a theme stream;
    从多个所述主题中筛选出包含重要事件的多个第一主题;Filtering a plurality of first topics including important events from a plurality of said topics;
    提取每一所述第一主题的关键字,并确定每一所述第一主题的关键字的关联关系;及Extracting keywords of each of the first topics, and determining associations of keywords of each of the first topics; and
    将每一所述第一主题的关键字及其关联关系添加至所述主题流,以生成与所述多个文本资料对应的话题演变脉络图。Adding keywords of each of the first topics and their associated relationships to the topic stream to generate a topic evolution context map corresponding to the plurality of text materials.
  9. 如权利要求8所述的应用服务器,其特征在于,所述话题演变的可视化展现系统被所述处理器执行时,还实现如下步骤:The application server according to claim 8, wherein when the visual presentation system of the topic evolution is executed by the processor, the following steps are further implemented:
    对所述多个文本资料进行预处理,所述预处理包括:对所述文本资料进行切分、繁简转化、替换歧义词、去除停用词、低频词、数字及标点符号。Pre-processing the plurality of text materials, the pre-processing comprising: segmenting, simplifying, replacing ambiguous words, removing stop words, low frequency words, numbers, and punctuation marks.
  10. 如权利要求8所述的应用服务器,其特征在于,所述建立所述主题流的步骤之后还包括:The application server according to claim 8, wherein the step of establishing the theme stream further comprises:
    识别每一所述主题的产生、分裂、合并、结束在所述主题流中的节点位置;及Identifying, generating, splitting, merging, and ending the node locations in the topic stream for each of the topics; and
    对每一所述主题的产生、分裂、合并、结束的节点位置运用不同的标记符号进行标示。The position of the nodes that generate, split, merge, and end each of the topics is marked with different notation symbols.
  11. 如权利要求8-10任一项所述的应用服务器,其特征在于,所述确定每一所述主题之间的关联关系,以建立一主题流的步骤包括:The application server according to any one of claims 8 to 10, wherein the step of determining an association relationship between each of the topics to establish a topic stream comprises:
    通过分层狄利克雷过程确定每一所述主题之间的关联关系,以建立所述 主题流;Determining an association relationship between each of the topics by a layered Dirichlet process to establish the topic stream;
    其中,所述分层狄利克雷过程包括计算从时刻t-1到时刻t,簇r中来自簇s的比例,及从时刻t-1到时刻t,簇s中流向簇r的比例,以确定每一所述主题之间的关联关系,将t时刻到来的第i个资料记为
    Figure PCTCN2018090694-appb-100006
    其所在的簇记为
    Figure PCTCN2018090694-appb-100007
    所述簇r中来所述自簇s的比例通过以下公式计算得到:
    The hierarchical Dirichlet process includes calculating a ratio from the cluster t to the cluster s from the time t-1 to the time t, and the ratio of the cluster s to the cluster r from the time t-1 to the time t, Determine the relationship between each of the topics, and record the ith data coming from time t as
    Figure PCTCN2018090694-appb-100006
    The cluster in which it is located is
    Figure PCTCN2018090694-appb-100007
    The ratio of the self-cluster s in the cluster r is calculated by the following formula:
    Figure PCTCN2018090694-appb-100008
    Figure PCTCN2018090694-appb-100008
    所述簇s中流向所述簇r的比例通过以下公式计算得到:The ratio of the clusters s flowing to the cluster r is calculated by the following formula:
    Figure PCTCN2018090694-appb-100009
    Figure PCTCN2018090694-appb-100009
  12. 如权利要求8-10任一项所述的应用服务器,其特征在于,所述从多个所述主题中筛选出包含重要事件的多个第一主题的步骤包括:The application server according to any one of claims 8 to 10, wherein the step of filtering out a plurality of first topics including important events from the plurality of the topics comprises:
    利用信息熵算法来计算每一所述主题的分值;及Using an information entropy algorithm to calculate a score for each of the topics; and
    根据计算得到的分值大小来从多个所述主题中筛选出包含重要事件的多个所述第一主题;Extracting a plurality of the first topics including important events from a plurality of the topics according to the calculated score size;
    其中,所述信息熵算法的计算公式为:Wherein, the calculation formula of the information entropy algorithm is:
    Figure PCTCN2018090694-appb-100010
    Figure PCTCN2018090694-appb-100010
    R(r,t)是簇r在时间t的排序分值,N r是流入簇r的元素数量。 R(r,t) is the ordering score of cluster r at time t, and N r is the number of elements flowing into cluster r.
  13. 如权利要求11所述的应用服务器,其特征在于,所述提取每一所述第一主题的关键字,并确定每一所述第一主题的关键字的关联关系的步骤包括:The application server according to claim 11, wherein the step of extracting keywords of each of the first topics and determining an association relationship of keywords of each of the first topics comprises:
    利用TF-IDF算法提取每一所述第一主题的关键字;及Extracting keywords of each of the first topics by using a TF-IDF algorithm; and
    通过分层狄利克雷过程确定每一所述第一主题的关键字的关联关系。The association relationship of the keywords of each of the first topics is determined by a layered Dirichlet process.
  14. 如权利要求12所述的应用服务器,其特征在于,所述提取每一所述第一主题的关键字,并确定每一所述第一主题的关键字的关联关系的步骤包括:The application server according to claim 12, wherein the step of extracting keywords of each of the first topics and determining an association relationship of keywords of each of the first topics comprises:
    利用TF-IDF算法提取每一所述第一主题的关键字;及Extracting keywords of each of the first topics by using a TF-IDF algorithm; and
    通过分层狄利克雷过程确定每一所述第一主题的关键字的关联关系。The association relationship of the keywords of each of the first topics is determined by a layered Dirichlet process.
  15. 一种计算机可读存储介质,所述计算机可读存储介质存储有话题演变的可视化展现系统,所述话题演变的可视化展现系统可被至少一个处理器执行,以使所述至少一个处理器执行如下步骤:A computer readable storage medium storing a visual presentation system of topic evolution, the visual evolution system of the topic evolution being executable by at least one processor to cause the at least one processor to perform the following step:
    提取涉及同一事件的多个文本资料的主题,并确定每一所述主题之间的关联关系,以建立一主题流;Extracting topics related to multiple textual materials of the same event, and determining an association relationship between each of the topics to establish a theme stream;
    从多个所述主题中筛选出包含重要事件的多个第一主题;Filtering a plurality of first topics including important events from a plurality of said topics;
    提取每一所述第一主题的关键字,并确定每一所述第一主题的关键字的关联关系;及Extracting keywords of each of the first topics, and determining associations of keywords of each of the first topics; and
    将每一所述第一主题的关键字及其关联关系添加至所述主题流,以生成与所述多个文本资料对应的话题演变脉络图。Adding keywords of each of the first topics and their associated relationships to the topic stream to generate a topic evolution context map corresponding to the plurality of text materials.
  16. 如权利要求15所述的计算机可读存储介质,其特征在于,所述话题演变的可视化展现系统被所述处理器执行时,还实现如下步骤:The computer readable storage medium of claim 15, wherein when the visual presentation system of the topic evolution is executed by the processor, the following steps are further implemented:
    对所述多个文本资料进行预处理,所述预处理包括:对所述文本资料进行切分、繁简转化、替换歧义词、去除停用词、低频词、数字及标点符号。Pre-processing the plurality of text materials, the pre-processing comprising: segmenting, simplifying, replacing ambiguous words, removing stop words, low frequency words, numbers, and punctuation marks.
  17. 如权利要求15所述的计算机可读存储介质,其特征在于,所述建立所述主题流的步骤之后还包括:The computer readable storage medium according to claim 15, wherein the step of establishing the theme stream further comprises:
    识别每一所述主题的产生、分裂、合并、结束在所述主题流中的节点位置;及Identifying, generating, splitting, merging, and ending the node locations in the topic stream for each of the topics; and
    对每一所述主题的产生、分裂、合并、结束的节点位置运用不同的标记符号进行标示。The position of the nodes that generate, split, merge, and end each of the topics is marked with different notation symbols.
  18. 如权利要求15-17任一项所述的计算机可读存储介质,其特征在于,所述确定每一所述主题之间的关联关系,以建立一主题流的步骤包括:The computer readable storage medium according to any one of claims 15 to 17, wherein the step of determining an association relationship between each of the topics to establish a topic stream comprises:
    通过分层狄利克雷过程确定每一所述主题之间的关联关系,以建立所述主题流;Determining an association relationship between each of the topics by a layered Dirichlet process to establish the topic stream;
    其中,所述分层狄利克雷过程包括计算从时刻t-1到时刻t,簇r中来自簇s的比例,及从时刻t-1到时刻t,簇s中流向簇r的比例,以确定每一所述主题之间的关联关系,将t时刻到来的第i个资料记为
    Figure PCTCN2018090694-appb-100011
    其所在的簇记为
    Figure PCTCN2018090694-appb-100012
    所述簇r中来所述自簇s的比例通过以下公式计算得到:
    The hierarchical Dirichlet process includes calculating a ratio from the cluster t to the cluster s from the time t-1 to the time t, and the ratio of the cluster s to the cluster r from the time t-1 to the time t, Determine the relationship between each of the topics, and record the ith data coming from time t as
    Figure PCTCN2018090694-appb-100011
    The cluster in which it is located is
    Figure PCTCN2018090694-appb-100012
    The ratio of the self-cluster s in the cluster r is calculated by the following formula:
    Figure PCTCN2018090694-appb-100013
    Figure PCTCN2018090694-appb-100013
    所述簇s中流向所述簇r的比例通过以下公式计算得到:The ratio of the clusters s flowing to the cluster r is calculated by the following formula:
    Figure PCTCN2018090694-appb-100014
    Figure PCTCN2018090694-appb-100014
  19. 如权利要求15-17任一项所述的计算机可读存储介质,其特征在于,所述从多个所述主题中筛选出包含重要事件的多个第一主题的步骤包括:The computer readable storage medium according to any one of claims 15-17, wherein the step of filtering a plurality of first topics including important events from the plurality of the topics comprises:
    利用信息熵算法来计算每一所述主题的分值;及Using an information entropy algorithm to calculate a score for each of the topics; and
    根据计算得到的分值大小来从多个所述主题中筛选出包含重要事件的多个所述第一主题;Extracting a plurality of the first topics including important events from a plurality of the topics according to the calculated score size;
    其中,所述信息熵算法的计算公式为:Wherein, the calculation formula of the information entropy algorithm is:
    Figure PCTCN2018090694-appb-100015
    Figure PCTCN2018090694-appb-100015
    R(r,t)是簇r在时间t的排序分值,N r是流入簇r的元素数量。 R(r,t) is the ordering score of cluster r at time t, and N r is the number of elements flowing into cluster r.
  20. 如权利要求19所述的计算机可读存储介质,其特征在于,所述提取每一所述第一主题的关键字,并确定每一所述第一主题的关键字的关联关系的步骤包括:The computer readable storage medium according to claim 19, wherein the step of extracting keywords of each of the first topics and determining an association relationship of keywords of each of the first topics comprises:
    利用TF-IDF算法提取每一所述第一主题的关键字;及Extracting keywords of each of the first topics by using a TF-IDF algorithm; and
    通过分层狄利克雷过程确定每一所述第一主题的关键字的关联关系。The association relationship of the keywords of each of the first topics is determined by a layered Dirichlet process.
PCT/CN2018/090694 2018-01-12 2018-06-11 Presentation method for visualization of topic evolution, application server, and computer readable storage medium WO2019136920A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810031859.7A CN108170838B (en) 2018-01-12 2018-01-12 Topic evolution visualization display method, application server and computer readable storage medium
CN201810031859.7 2018-01-12

Publications (1)

Publication Number Publication Date
WO2019136920A1 true WO2019136920A1 (en) 2019-07-18

Family

ID=62514662

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/090694 WO2019136920A1 (en) 2018-01-12 2018-06-11 Presentation method for visualization of topic evolution, application server, and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN108170838B (en)
WO (1) WO2019136920A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328747A (en) * 2020-11-06 2021-02-05 平安科技(深圳)有限公司 Event context generation method and device, terminal equipment and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112287172A (en) * 2020-10-29 2021-01-29 药渡经纬信息科技(北京)有限公司 Video album generating method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150186532A1 (en) * 2013-12-31 2015-07-02 Google Inc. Generating a News Timeline
CN104915446A (en) * 2015-06-29 2015-09-16 华南理工大学 Automatic extracting method and system of event evolving relationship based on news
CN106951554A (en) * 2017-03-29 2017-07-14 浙江大学 A kind of stratification hot news and its excavation and the method for visualizing of evolution
CN107315807A (en) * 2017-06-26 2017-11-03 三螺旋大数据科技(昆山)有限公司 Talent recommendation method and apparatus

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231640B (en) * 2007-01-22 2010-09-22 北大方正集团有限公司 Method and system for automatically computing subject evolution trend in the internet
CN103177024A (en) * 2011-12-23 2013-06-26 微梦创科网络科技(中国)有限公司 Method and device of topic information show
CN103473263B (en) * 2013-07-18 2017-02-08 大连理工大学 News event development process-oriented visual display method
JP6270216B2 (en) * 2014-09-25 2018-01-31 Kddi株式会社 Clustering apparatus, method and program
CN106649726A (en) * 2016-12-23 2017-05-10 中山大学 Association-topic evolution mining method in social network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150186532A1 (en) * 2013-12-31 2015-07-02 Google Inc. Generating a News Timeline
CN104915446A (en) * 2015-06-29 2015-09-16 华南理工大学 Automatic extracting method and system of event evolving relationship based on news
CN106951554A (en) * 2017-03-29 2017-07-14 浙江大学 A kind of stratification hot news and its excavation and the method for visualizing of evolution
CN107315807A (en) * 2017-06-26 2017-11-03 三螺旋大数据科技(昆山)有限公司 Talent recommendation method and apparatus

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328747A (en) * 2020-11-06 2021-02-05 平安科技(深圳)有限公司 Event context generation method and device, terminal equipment and storage medium

Also Published As

Publication number Publication date
CN108170838A (en) 2018-06-15
CN108170838B (en) 2022-07-08

Similar Documents

Publication Publication Date Title
CN109145216B (en) Network public opinion monitoring method, device and storage medium
CN109325165B (en) Network public opinion analysis method, device and storage medium
CN109101469B (en) Extracting searchable information from digitized documents
US10380197B2 (en) Network searching method and network searching system
US10445063B2 (en) Method and apparatus for classifying and comparing similar documents using base templates
US20160132521A1 (en) Systems and methods for file clustering, multi-drive forensic analysis and data protection
US11048863B2 (en) Producing visualizations of elements in works of literature
US20130283148A1 (en) Extraction of Content from a Web Page
JP2017224184A (en) Machine learning device
WO2019061989A1 (en) Loan risk control method, electronic device and readable storage medium
CN112016273A (en) Document directory generation method and device, electronic equipment and readable storage medium
CN114462616A (en) Machine learning model for preventing sensitive data from being disclosed online
US9330075B2 (en) Method and apparatus for identifying garbage template article
CN108763961B (en) Big data based privacy data grading method and device
US20120046937A1 (en) Semantic classification of variable data campaign information
CN112597135A (en) User classification method and device, electronic equipment and readable storage medium
WO2019136920A1 (en) Presentation method for visualization of topic evolution, application server, and computer readable storage medium
JP6898542B2 (en) Information processing device, its control method, and program
WO2019227705A1 (en) Image entry method, server and computer storage medium
US11803796B2 (en) System, method, electronic device, and storage medium for identifying risk event based on social information
CN109670183B (en) Text importance calculation method, device, equipment and storage medium
CN111611388A (en) Account classification method, device and equipment
CN110020120B (en) Feature word processing method, device and storage medium in content delivery system
WO2022105120A1 (en) Text detection method and apparatus from image, computer device and storage medium
US20170097991A1 (en) Automatically branding topics using color

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15.10.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18900400

Country of ref document: EP

Kind code of ref document: A1