CN111324801B - Hot event discovery method in judicial field based on hot words - Google Patents

Hot event discovery method in judicial field based on hot words Download PDF

Info

Publication number
CN111324801B
CN111324801B CN202010096023.2A CN202010096023A CN111324801B CN 111324801 B CN111324801 B CN 111324801B CN 202010096023 A CN202010096023 A CN 202010096023A CN 111324801 B CN111324801 B CN 111324801B
Authority
CN
China
Prior art keywords
hot
news
words
public
public opinion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010096023.2A
Other languages
Chinese (zh)
Other versions
CN111324801A (en
Inventor
余正涛
梁昊远
毛存礼
郭军军
黄于欣
张勇丙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202010096023.2A priority Critical patent/CN111324801B/en
Publication of CN111324801A publication Critical patent/CN111324801A/en
Application granted granted Critical
Publication of CN111324801B publication Critical patent/CN111324801B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method for discovering hot events in the judicial field based on hot words, belonging to the field of natural language processing. The method comprises the steps of processing crawled judicial public opinion news, dividing words by a HanLP tool, extracting public opinion elements in the public opinion news, performing word frequency statistics on the public opinion elements to obtain a hot word set, establishing a corresponding relation between the hot words and the public opinion news, and performing pairwise evaluation on the public opinion news by a similarity calculation system to determine whether the public opinion news is merged, wherein the similarity calculation system comprises three subsystems: the public opinion news text element statistics-based public opinion title similarity calculation system comprises a text similarity calculation system based on public opinion news text element statistics, a BERT-based public opinion title similarity calculation system and a tf-idf-based public opinion title similarity calculation system. After the results of the subsystems are obtained, whether the two public opinion texts belong to the same hot event or not is determined in a mode of respectively setting a threshold. And finally, determining whether the two public opinions are the same hot event according to the final result of the subsystem.

Description

Hot event discovery method in judicial field based on hot words
Technical Field
The invention relates to a hot event discovery method in the judicial field based on hot words, belonging to the technical field of natural language processing.
Background
With the rapid development of the internet, social networks are becoming important channels for obtaining and sharing news. A great deal of public sentiments in the judicial field are also shown in the Internet, and the public sentiments in the judicial field focus on the problem of social hotspot cases, so that the social hotspots are easily focused on some sensitive events, and the social public sentiments show a virus-type outbreak trend. It is therefore an important question of how quickly and efficiently the judicial authorities can find these hot events.
On the internet, the real-time performance of data is very strong, and the number of public opinions related to some hot cases is very easy to increase explosively in a short time, so that in the discovery of the hot events of public opinions in the judicial field, the data in the internet needs to be crawled regularly and the time interval cannot be too long. This has led to the following characteristics of judicial domain data: (1) the number of data topics is difficult to predict; (2) the public sentiment distribution of each theme in the data is unbalanced; (3) the data is heavily populated with noise. Due to these characteristics, the conventional topic model cannot obtain an accurate topic.
In summary, it is desirable to provide a hot event discovery method in the judicial field, which can increase the hot event discovery speed and improve the accuracy.
Disclosure of Invention
In order to solve the problems, the invention provides a hot event discovery method in the judicial field based on hot words.
The technical scheme of the invention is as follows: the judicial domain hot event discovery method based on the hot words comprises the following specific steps:
step1, crawling judicial public opinion news by using a crawler, preprocessing data, dividing words by using an open source tool, extracting elements in the Chinese public opinion news to obtain an element set, and performing word frequency statistics on the elements;
step2, if the database does not have a hot event, defining the element with the word frequency more than or equal to the threshold value in Step1 as a hot word, calculating the similarity between the public sentiments corresponding to each hot word through a similarity calculation system, and determining whether the public sentiments are merged according to the returned result; if the similarity is greater than or equal to the threshold value, the public sentiment news and the corresponding hot words are merged into a hot event, and the public sentiment news with the similarity smaller than the threshold value is discarded;
step3, if the database has a hotspot event, defining the elements with the word frequency greater than or equal to the threshold value in Step1 as hotspot words, and classifying the hotspot words into a hotspot word set; comparing the elements with the word frequency smaller than the threshold value with the hot words under the original hot spot events, if the elements appear in the hot word set under the original hot spot events, defining the elements as the hot words and classifying the elements into the hot word set, otherwise, discarding the elements;
step4, carrying out similarity calculation on the public sentiment news corresponding to the hot words in the hot word set obtained in Step3 and the public sentiment news under the original hot events through a similarity calculation system, and determining whether the public sentiment news belongs to the original hot events or new hot events according to the calculation result; if the similarity is larger than or equal to the threshold value, the hot words and the corresponding public sentiment news are classified into the original hot events, otherwise, the public sentiment news is newly added into a new hot event.
Further, the specific steps of Step1 are as follows:
crawling judicial public opinion news from a Xinlang microblog and news website by using crawlers, and preprocessing the crawlers to obtain news data;
using an open source tool HanLP to perform word segmentation on the obtained judicial public opinion data and extracting public opinion elements to obtain an element set;
and carrying out word frequency statistics on the public opinion element set.
Further, the specific steps of Step2 are as follows:
if the database does not have the hot event, defining the elements with the word frequency larger than or equal to the threshold as hot words, and establishing the corresponding relation between each hot word and the public sentiment news of the source of the hot word;
and calculating the similarity between the public sentiment news corresponding to each hot word through a similarity calculation system, and merging the public sentiment news with the result of 'True' returned by the similarity calculation system and the corresponding hot words into a hot event.
Further, the specific steps of Step4 are as follows:
establishing a corresponding relation between the hot words in the hot word set obtained in Step3 and the public sentiment news from the hot words to obtain a hot word-public sentiment news set;
carrying out similarity calculation on all public sentiment news under the hot word-public sentiment news set and the public sentiment news under the original hot event through a similarity calculation system;
if the final returned result is 'True', merging the public sentiment news and the corresponding hot words thereof with the public sentiment news and the hot words under the original hot events;
and if the final returned result is 'False', newly adding a new hot event by using the public sentiment news and the corresponding hot words.
Further, the similarity calculation system includes the following subsystems:
a text similarity calculation system based on public opinion news text element statistics;
the public sentiment elements of each public sentiment text are counted, the similarity between the two public sentiment texts is calculated by utilizing the co-occurrence of the element words of the two public sentiment texts, if the similarity is greater than or equal to a threshold value, the True is returned, and otherwise, the False is returned;
a BERT-based public opinion title similarity calculation system;
representing each public opinion title by using word vectors pre-trained by BERT, calculating the similarity between two titles by calculating the Euclidean distance between two title texts, if the similarity is greater than or equal to a threshold value, returning to 'True', otherwise, returning to 'False';
a public opinion title similarity calculation system based on tf-idf;
representing each public opinion title by utilizing tf-idf, calculating the similarity between two titles by calculating the cosine similarity between two title texts, if the similarity is more than or equal to a threshold value, returning to 'True', otherwise, returning to 'False';
and finally, judging according to the results of the three subsystems, and if 2 or more than 2 'True's exist in the three results, considering the two pieces of public opinion texts as similar texts and belonging to the same hot event.
According to the concept of the present invention, the present invention also provides a public opinion hotspot event discovery device in judicial field based on hotspot words, as shown in fig. 4, the device comprising:
the data acquisition module is used for acquiring judicial public sentiment news on the network by utilizing a web crawler technology and preprocessing and segmenting data;
the element extraction module is used for extracting public opinion elements from the obtained judicial public opinion data by using an open source tool HanLP (https:// github. com/hankcs/HanLP);
the word frequency statistic module is used for carrying out word frequency statistics on the public sentiment elements obtained by the element extraction module;
the similarity calculation module is used for calculating the similarity between the public sentiment news by using the elements of the public sentiment news, tf-idf and word vectors pre-trained by BERT;
and the merging module judges whether to merge according to the result of the similarity calculation module.
The beneficial effects of the invention are:
according to the method, the hot events in the judicial field are discovered by defining the hot words, so that the text can be represented by simple words, and the interference of a small amount of words on the representation of the text is prevented; the text similarity calculation system combines word frequency statistics, BERT pre-training word vectors and tf-idf to represent texts, and calculates the similarity of the texts by using Euclidean distance and cosine similarity, thereby improving the accuracy of hot events.
Compared with the traditional subject model, the method and the device provided by the invention have simpler structure, and have higher efficiency and accuracy under the conditions of less data volume, unbalanced distribution and larger noise.
Drawings
FIG. 1 is a schematic diagram of the process steps of the present invention;
FIG. 2 is a schematic flow diagram of a process of the present invention;
FIG. 3 is a schematic flow chart of a similarity calculation system according to the present invention;
fig. 4 is a schematic structural diagram of the device of the present invention.
Detailed Description
Example 1: fig. 1 shows a hot event discovery method in judicial fields based on hot words, fig. 2 is a schematic flow chart of the method of the present invention, fig. 3 is a schematic diagram of a similarity calculation system of the present invention, and fig. 4 is a schematic structural diagram of the apparatus of the present invention.
The method comprises the following specific steps:
step A, crawling judicial public opinion news by using crawler, preprocessing the data, and using an open source tool HanLP (A)https://github.com/hankcs/HanLP) Performing word segmentation on the text, extracting essential elements in Chinese public sentiment news to obtain an element set, and performing word frequency statistics on the elements;
b, if the database does not have a hot event, defining the elements with the word frequency larger than or equal to the threshold in the step A as hot words and classifying the hot words into a hot word set, calculating the similarity between the public sentiments corresponding to the hot words through a similarity calculation system, and determining whether the public sentiments are combined according to the returned result; if the similarity is greater than or equal to the threshold value, the public opinion news and the corresponding hot words are merged into a hot event, and the public opinion news with the similarity smaller than the threshold value is discarded;
step C, if the database has a hot event, defining the elements with the word frequency larger than or equal to the threshold value in the step A as hot words, and classifying the hot words into a hot word set; comparing the elements with the word frequency smaller than the threshold value with the hot words under the original hot event, if the elements appear in the hot word set under the original hot event, defining the elements as the hot words and classifying the elements into the hot word set, otherwise, discarding the elements;
and D, carrying out similarity calculation on the public sentiment news corresponding to the hot words in the hot word set obtained in the step C and the public sentiment news under the original hot event through a similarity calculation system, if the result obtained by the similarity calculation system is 'True', classifying the hot words and the corresponding public sentiment news into the original hot event, and otherwise, combining the hot words and the public sentiment news into a new hot event.
In the step A, the data of the invention mainly comprise Sina microblog and judicial public opinions in various news websites, and after data preprocessing, an open source tool HanLP (https:// github. com/hankcs/HanLP) is used for word segmentation and element extraction. In public opinion news, the most distinguishing is the human name and the organization name, and the place name is too frequent to be used as a public opinion element. Suppose that the crawled public opinion news data set is D ═ D1,d2,...,dNN is the total number of texts. Then the public opinion news data set D is expressed as X ═ { X ═ X by the element set obtained after HanLP extraction of the elements1,x2,...xMAnd M is the number of public sentiment elements. After the element set is obtained, carrying out word frequency statistics on elements in the element set;
in the step B, defining elements with the word frequency larger than or equal to a threshold value in the element set X as hot words, and establishing a hot word set R ═ { R ═ R1,r2,...,rnAnd f, wherein n is the total number of the hot words, and then establishing the corresponding relation between the hot words and the public sentiment text. For example: hot word r1From public opinion text { d1,d2,d3Is then expressed as r1→{d1,d2,d3}. According to practical situations, the invention sets the threshold value of the word frequency to be 10. And then, calculating the similarity between the public opinion texts by using a similarity calculation system. The similarity calculation system comprises the following subsystems:
the method comprises the following steps of firstly, calculating the text similarity based on public opinion news text element statistics;
secondly, a public opinion title similarity calculation system based on BERT;
and thirdly, a public opinion title similarity calculation system based on tf-idf.
The following describes the specific embodiments of the three subsystems:
the text similarity calculation system based on the text element statistics of the public opinion news mainly uses elements in the public opinion news to calculate, and the calculation formula is as follows:
Figure BDA0002385299600000051
where F represents the similarity of two texts, xi∩xjRepresents the number of intersection of elements in two public sentiment texts, | xmaxRepresenting the maximum number of elements in the two public opinion texts.
If the similarity of the two texts is larger than or equal to the threshold value, the two texts are considered to belong to the same hot spot event, the "True" is returned, otherwise, the "False" is returned, and the threshold value is set to be 0.4.
The public opinion title similarity calculation system based on BERT mainly uses Chinese word vectors pre-trained by a BERT model proposed by Google and Euclidean distance to calculate the similarity of texts, and the specific flow is as follows:
firstly, constructing a word list according to word segmentation results, wherein each word has a corresponding ID;
then, setting the text length (L) to be 15, cutting off the text with the word number exceeding L, and filling the text with the word number smaller than L with 0 at the end of the text;
text is characterized using the open source tool BERT (https:// github. com/ymcui/Chinese-BERT-wwm), each word being represented as a vector
Figure BDA0002385299600000052
Where D is the dimension of the word vector. Then the text
Figure BDA0002385299600000053
The word vectors are then added for each text word by word, i.e.
Figure BDA0002385299600000054
Thus, finally
Figure BDA0002385299600000055
And then, measuring the similarity of the two texts by using Euclidean distance, wherein the Euclidean distance is the most common distance calculation formula, and the measured Euclidean distance is the absolute distance between each point in the multi-dimensional space, and the calculation formula is as follows:
Figure BDA0002385299600000061
if the similarity of the two texts is larger than or equal to the threshold value, the two texts are considered to belong to the same hot spot event, the "True" is returned, otherwise, the "False" is returned, and the threshold value is set to be 0.5.
The public opinion title similarity calculation system based on tf-idf mainly uses tf-idf to represent texts and uses cosine similarity to calculate similarity between the texts, and the specific flow is as follows:
firstly, constructing a word list according to word segmentation results, wherein each word has a corresponding ID;
then setting the text length (L) as 15, truncating the text with the number of words exceeding L, and filling the text with the number of words less than L with 0 at the end of the text;
and calculating the weight corresponding to each word by using a tf-idf formula, wherein td-idf comprises tf (word frequency) and idf (inverse document frequency), and the calculation formula is as follows:
Figure BDA0002385299600000062
wherein n isi,jIndicates that the word is in document djNumber of occurrences, ∑knk,jThen represents document djThe sum of the times all words in (a) appear.
Figure BDA0002385299600000063
Where | D | represents the total number of documents in the corpus, | { j: t |, wherei∈djDenotes an inclusion of a word tiTo avoid the case where the denominator is 0, so one is added to the denominator entry. Finally, the text may be represented by:
s=tfij×idfi
therefore, the cosine similarity of two public opinion titles can be calculated by the following formula:
Figure BDA0002385299600000064
if the similarity of the two texts is larger than or equal to the threshold value, the two texts are considered to belong to the same hot spot event, the "True" is returned, otherwise, the "False" is returned, and the threshold value is set to be 0.75.
Finally, the final result is determined by the return values of the three subsystems, and the judgment rule is as follows: and if two or more returned results in the three subsystems are 'True', the two texts are considered to belong to the same hot event, and the 'True' is returned, otherwise, the 'False' is returned.
In the step B, defining elements with the word frequency larger than or equal to a threshold value in the element set X as hot words, and establishing a hot word set R ═ { R ═ R1,r2,...,rnAnd f, wherein n is the total number of the hot words, and then establishing the corresponding relation between the hot words and the public sentiment text. For example: hot word r1From public opinion text { d1,d2,d3Is then expressed as r1→{d1,d2,d3}. According to practical situations, the invention sets the threshold value of the word frequency to be 10. And then, calculating the similarity between the public opinion texts by using a similarity calculation system. The similarity calculation system comprises the following subsystems:
the step C comprises the following steps: c01 step: defining the elements with the word frequency larger than or equal to a threshold value in the element set X as hot words and establishing a hot word set; c02 step: judging whether the element in the element set X is a hot word under the original hot event, wherein the word frequency of the element is less than the threshold value; c04 step: and establishing a corresponding relation between the obtained hot words and the source public sentiment news thereof.
In step C01, the element set X obtained in step a is set to { X ═ X1,x2,...xMDefining elements with word frequency larger than or equal to a threshold value as hot words, and establishing a hot word set R (R)1,r2,...,rnN is the total number of hot words, where the threshold is also set to 10;
in step C02, a hot word set R 'under the original hot event is obtained, elements with a word frequency less than a threshold in the element set X are compared with the hot words in R', and if the elements appear in the hot words under the original hot event, X is the elementiE R', then the element is still defined as a hotspot word and is included in the hotspot word set R. Since judicial cases involve a plurality of events such as case occurrence, police arrest, court trial and second trial, the setting here is to prevent a case from being defined as a new hotspot event due to the time interval of the event;
in step C03, the obtained hot word set R ═ { R ═ R1,r2,...,rnThe corresponding relation r is established between the hot words under the situation and the public sentiment news from the hot wordsi→{d1,d2,...,dn}. For example: hot word r1E.g. R is public sentiment text { d1,d2,d3The element in which the word frequency is greater than the threshold value appears in (i) represents this relationship as r1→{d1,d2,d3}。
In the step D, the corresponding relation r between all the hot words and the public sentiment text obtained in the step C is utilizedi→{d1,d2,...,dnAnd judging whether the two texts belong to the same hot spot event or not through the similarity calculation system in the step B. If the returned result is "True", the hot word r is addediAnd its corresponding public opinion text { d1,d2,...,dnMerge into the compared original hotspot event; otherwise, calculating similarity between every two public sentiment texts again to combine the two public sentiment texts, combining the public sentiment text with the returned result of "True" and the hot words thereof into a new hot event, and otherwise, adding the public sentiment text into a new hot event.
According to the concept of the present invention, the present invention also provides a device for discovering public opinion hotspot events in judicial fields based on hotspot words, as shown in fig. 4, the device comprising:
the data acquisition module is used for acquiring judicial public sentiment news on the network by utilizing a web crawler technology and preprocessing and segmenting data;
the element extraction module is used for extracting public opinion elements from the obtained judicial public opinion data by using an open source tool HanLP (https:// github. com/hankcs/HanLP);
the word frequency statistic module is used for carrying out word frequency statistics on the public sentiment elements obtained by the element extraction module;
the similarity calculation module is used for calculating the similarity between the public sentiment news by using the elements of the public sentiment news, tf-idf and word vectors pre-trained by BERT;
and the merging module judges whether to merge according to the result of the similarity calculation module.
While the present invention has been described in detail with reference to the embodiments, the present invention is not limited to the embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.

Claims (5)

1. A judicial domain hot event discovery method based on hot words is characterized by comprising the following steps:
the method comprises the following specific steps:
step1, crawling judicial public opinion news by using a crawler, preprocessing data, dividing words by using an open source tool, extracting elements in the Chinese public opinion news to obtain an element set, and performing word frequency statistics on the elements;
step2, if the database does not have a hot event, defining the element with the word frequency more than or equal to the threshold value in Step1 as a hot word, calculating the similarity between the public sentiments corresponding to each hot word through a similarity calculation system, and determining whether the public sentiments are merged according to the returned result;
step3, if the database has a hotspot event, defining the elements with the word frequency larger than or equal to the threshold value in Step1 as hotspot words, and classifying the hotspot words into a hotspot word set; comparing the elements with the word frequency smaller than the threshold value with the hot words under the original hot event, if the elements appear in the hot word set under the original hot event, defining the elements as the hot words and classifying the elements into the hot word set, otherwise, discarding the elements;
and Step4, carrying out similarity calculation on the public sentiment news corresponding to the hot words in the hot word set obtained in Step3 and the public sentiment news under the original hot events through a similarity calculation system, and determining whether the public sentiment news belongs to the original hot events or new hot events according to the calculation result.
2. The judicial domain hotspot event discovery method based on hotspot words of claim 1, wherein: the specific steps of Step1 are as follows:
crawling judicial public opinion news from a Xinlang microblog and news website by using crawlers, and preprocessing the crawlers to obtain news data;
using an open source tool HanLP to perform word segmentation on the obtained judicial public opinion data and extracting public opinion elements to obtain an element set;
and carrying out word frequency statistics on the public opinion element set.
3. The judicial domain hotspot event discovery method based on hotspot words of claim 1, wherein: the specific steps of Step2 are as follows:
if the database does not have the hot event, defining the elements with the word frequency larger than or equal to the threshold as hot words, and establishing the corresponding relation between each hot word and the public sentiment news of the source of the hot word;
and calculating the similarity between the public sentiment news corresponding to each hot word through a similarity calculation system, and merging the public sentiment news with the result of 'True' returned by the similarity calculation system and the corresponding hot words into a hot event.
4. The judicial domain hotspot event discovery method based on hotspot words of claim 1, wherein: the specific steps of Step4 are as follows:
establishing a corresponding relation between the hot words in the hot word set obtained in Step3 and the public sentiment news of the source thereof to obtain a hot word-public sentiment news set;
carrying out similarity calculation on all public sentiment news under the hot word-public sentiment news set and the public sentiment news under the original hot event through a similarity calculation system;
if the final returned result is 'True', merging the public sentiment news and the corresponding hot words thereof with the public sentiment news and the hot words under the original hot events;
and if the final returned result is 'False', newly adding a new hot event by using the public sentiment news and the corresponding hot words.
5. The judicial domain hotspot event discovery method based on hotspot words of claim 1, wherein: the similarity calculation system comprises the following subsystems:
a text similarity calculation system based on public opinion news text element statistics;
the public sentiment elements of each public sentiment text are counted, the similarity between the two public sentiment texts is calculated by utilizing the co-occurrence of the element words of the two public sentiment texts, if the similarity is greater than or equal to a threshold value, the True is returned, and otherwise, the False is returned;
a BERT-based public opinion title similarity calculation system;
representing each public opinion title by using word vectors pre-trained by BERT, calculating the similarity between two titles by calculating the Euclidean distance between two title texts, if the similarity is greater than or equal to a threshold value, returning to 'True', otherwise, returning to 'False';
a public opinion title similarity calculation system based on tf-idf;
representing each public opinion title by utilizing tf-idf, calculating the similarity between two titles by calculating the cosine similarity between two title texts, if the similarity is more than or equal to a threshold value, returning to 'True', otherwise, returning to 'False';
and finally, judging according to the results of the three subsystems, and if 2 or more than 2 'True's exist in the three results, considering the two pieces of public opinion texts as similar texts and belonging to the same hot event.
CN202010096023.2A 2020-02-17 2020-02-17 Hot event discovery method in judicial field based on hot words Active CN111324801B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010096023.2A CN111324801B (en) 2020-02-17 2020-02-17 Hot event discovery method in judicial field based on hot words

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010096023.2A CN111324801B (en) 2020-02-17 2020-02-17 Hot event discovery method in judicial field based on hot words

Publications (2)

Publication Number Publication Date
CN111324801A CN111324801A (en) 2020-06-23
CN111324801B true CN111324801B (en) 2022-06-21

Family

ID=71172718

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010096023.2A Active CN111324801B (en) 2020-02-17 2020-02-17 Hot event discovery method in judicial field based on hot words

Country Status (1)

Country Link
CN (1) CN111324801B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881275B (en) * 2020-07-24 2024-02-13 新华智云科技有限公司 Efficient hot spot identification and matching method
CN111984787A (en) * 2020-08-17 2020-11-24 深圳新闻网传媒股份有限公司 Public opinion hotspot obtaining method and system based on internet data
CN113343118A (en) * 2021-04-23 2021-09-03 东南大学 Hot event discovery method under mixed new media
CN113378023B (en) * 2021-05-24 2023-05-23 华北科技学院(中国煤矿安全技术培训中心) Civil public opinion and news information mining comparison visualization system
CN113609298A (en) * 2021-08-23 2021-11-05 南京擎盾信息科技有限公司 Data processing method and device for court public opinion corpus extraction

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577501A (en) * 2012-08-10 2014-02-12 深圳市世纪光速信息技术有限公司 Hot topic searching system and hot topic searching method
CN103823792A (en) * 2014-03-07 2014-05-28 网易(杭州)网络有限公司 Method and equipment for detecting hotspot events from text document
CN103870474A (en) * 2012-12-11 2014-06-18 北京百度网讯科技有限公司 News topic organizing method and device
CN104159158A (en) * 2013-05-15 2014-11-19 中兴通讯股份有限公司 Hotspot playing method and device of video file
CN106844786A (en) * 2016-12-08 2017-06-13 中国电子科技网络信息安全有限公司 A kind of public sentiment region focus based on text similarity finds method
CN106951498A (en) * 2017-03-15 2017-07-14 国信优易数据有限公司 Text clustering method
CN108170692A (en) * 2016-12-07 2018-06-15 腾讯科技(深圳)有限公司 A kind of focus incident information processing method and device
WO2018160747A1 (en) * 2017-02-28 2018-09-07 Laserlike Inc. Enhanced search to generate a feed based on a user's interests
CN110399478A (en) * 2018-04-19 2019-11-01 清华大学 Event finds method and apparatus

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577501A (en) * 2012-08-10 2014-02-12 深圳市世纪光速信息技术有限公司 Hot topic searching system and hot topic searching method
CN103870474A (en) * 2012-12-11 2014-06-18 北京百度网讯科技有限公司 News topic organizing method and device
CN104159158A (en) * 2013-05-15 2014-11-19 中兴通讯股份有限公司 Hotspot playing method and device of video file
CN103823792A (en) * 2014-03-07 2014-05-28 网易(杭州)网络有限公司 Method and equipment for detecting hotspot events from text document
CN108170692A (en) * 2016-12-07 2018-06-15 腾讯科技(深圳)有限公司 A kind of focus incident information processing method and device
CN106844786A (en) * 2016-12-08 2017-06-13 中国电子科技网络信息安全有限公司 A kind of public sentiment region focus based on text similarity finds method
WO2018160747A1 (en) * 2017-02-28 2018-09-07 Laserlike Inc. Enhanced search to generate a feed based on a user's interests
CN106951498A (en) * 2017-03-15 2017-07-14 国信优易数据有限公司 Text clustering method
CN110399478A (en) * 2018-04-19 2019-11-01 清华大学 Event finds method and apparatus

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Yanming Feng等.Micro-blog topic detection method based on BTM topic model and K-means clustering algorithm.《Automatic Control and Computer Sciences》.2016,第50卷 *
李华等.基于影响力的微博新兴热点事件检测.《计算机应用与软件》.2016,第33卷(第5期), *
柳笛.基于分布式框架的网络教育新闻热点话题发现研究.《中国优秀硕士学位论文全文数据库 信息科技辑》.2019, *

Also Published As

Publication number Publication date
CN111324801A (en) 2020-06-23

Similar Documents

Publication Publication Date Title
CN111324801B (en) Hot event discovery method in judicial field based on hot words
US8630972B2 (en) Providing context for web articles
CN107885793A (en) A kind of hot microblog topic analyzing and predicting method and system
CN102662952B (en) Chinese text parallel data mining method based on hierarchy
CN108763348B (en) Classification improvement method for feature vectors of extended short text words
CN110543595B (en) In-station searching system and method
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN106033445B (en) The method and apparatus for obtaining article degree of association data
US10387805B2 (en) System and method for ranking news feeds
Irena et al. Fake news (hoax) identification on social media twitter using decision tree c4. 5 method
CN103902597A (en) Method and device for determining search relevant categories corresponding to target keywords
CN109558587B (en) Method for classifying public opinion tendency recognition aiming at category distribution imbalance
CN109710825A (en) Webpage harmful information identification method based on machine learning
CN113032557A (en) Microblog hot topic discovery method based on frequent word set and BERT semantics
CN106446124A (en) Website classification method based on network relation graph
JP5527845B2 (en) Document classification program, server and method based on textual and external features of document information
CN111859079B (en) Information searching method, device, computer equipment and storage medium
Guha Related Fact Checks: a tool for combating fake news
CN111930949B (en) Search string processing method and device, computer readable medium and electronic equipment
Chen et al. Research on clustering analysis of Internet public opinion
Zhang et al. A hot spot clustering method based on improved kmeans algorithm
JP5364010B2 (en) Sentence search program, server and method using non-search keyword dictionary for search keyword dictionary
CN113705217B (en) Literature recommendation method and device for knowledge learning in electric power field
CN113111645B (en) Media text similarity detection method
Saravanan et al. Extraction of Core Web Content from Web Pages using Noise Elimination.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant