CN111324801B

CN111324801B - Hot event discovery method in judicial field based on hot words

Info

Publication number: CN111324801B
Application number: CN202010096023.2A
Authority: CN
Inventors: 余正涛; 梁昊远; 毛存礼; 郭军军; 黄于欣; 张勇丙
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2020-02-17
Filing date: 2020-02-17
Publication date: 2022-06-21
Anticipated expiration: 2040-02-17
Also published as: CN111324801A

Abstract

The invention relates to a method for discovering hot events in the judicial field based on hot words, belonging to the field of natural language processing. The method comprises the steps of processing crawled judicial public opinion news, dividing words by a HanLP tool, extracting public opinion elements in the public opinion news, performing word frequency statistics on the public opinion elements to obtain a hot word set, establishing a corresponding relation between the hot words and the public opinion news, and performing pairwise evaluation on the public opinion news by a similarity calculation system to determine whether the public opinion news is merged, wherein the similarity calculation system comprises three subsystems: the public opinion news text element statistics-based public opinion title similarity calculation system comprises a text similarity calculation system based on public opinion news text element statistics, a BERT-based public opinion title similarity calculation system and a tf-idf-based public opinion title similarity calculation system. After the results of the subsystems are obtained, whether the two public opinion texts belong to the same hot event or not is determined in a mode of respectively setting a threshold. And finally, determining whether the two public opinions are the same hot event according to the final result of the subsystem.

Description

Hot event discovery method in judicial field based on hot words

Technical Field

The invention relates to a hot event discovery method in the judicial field based on hot words, belonging to the technical field of natural language processing.

Background

With the rapid development of the internet, social networks are becoming important channels for obtaining and sharing news. A great deal of public sentiments in the judicial field are also shown in the Internet, and the public sentiments in the judicial field focus on the problem of social hotspot cases, so that the social hotspots are easily focused on some sensitive events, and the social public sentiments show a virus-type outbreak trend. It is therefore an important question of how quickly and efficiently the judicial authorities can find these hot events.

On the internet, the real-time performance of data is very strong, and the number of public opinions related to some hot cases is very easy to increase explosively in a short time, so that in the discovery of the hot events of public opinions in the judicial field, the data in the internet needs to be crawled regularly and the time interval cannot be too long. This has led to the following characteristics of judicial domain data: (1) the number of data topics is difficult to predict; (2) the public sentiment distribution of each theme in the data is unbalanced; (3) the data is heavily populated with noise. Due to these characteristics, the conventional topic model cannot obtain an accurate topic.

In summary, it is desirable to provide a hot event discovery method in the judicial field, which can increase the hot event discovery speed and improve the accuracy.

Disclosure of Invention

In order to solve the problems, the invention provides a hot event discovery method in the judicial field based on hot words.

The technical scheme of the invention is as follows: the judicial domain hot event discovery method based on the hot words comprises the following specific steps:

step1, crawling judicial public opinion news by using a crawler, preprocessing data, dividing words by using an open source tool, extracting elements in the Chinese public opinion news to obtain an element set, and performing word frequency statistics on the elements;

step2, if the database does not have a hot event, defining the element with the word frequency more than or equal to the threshold value in Step1 as a hot word, calculating the similarity between the public sentiments corresponding to each hot word through a similarity calculation system, and determining whether the public sentiments are merged according to the returned result; if the similarity is greater than or equal to the threshold value, the public sentiment news and the corresponding hot words are merged into a hot event, and the public sentiment news with the similarity smaller than the threshold value is discarded;

step3, if the database has a hotspot event, defining the elements with the word frequency greater than or equal to the threshold value in Step1 as hotspot words, and classifying the hotspot words into a hotspot word set; comparing the elements with the word frequency smaller than the threshold value with the hot words under the original hot spot events, if the elements appear in the hot word set under the original hot spot events, defining the elements as the hot words and classifying the elements into the hot word set, otherwise, discarding the elements;

step4, carrying out similarity calculation on the public sentiment news corresponding to the hot words in the hot word set obtained in Step3 and the public sentiment news under the original hot events through a similarity calculation system, and determining whether the public sentiment news belongs to the original hot events or new hot events according to the calculation result; if the similarity is larger than or equal to the threshold value, the hot words and the corresponding public sentiment news are classified into the original hot events, otherwise, the public sentiment news is newly added into a new hot event.

Further, the specific steps of Step1 are as follows:

crawling judicial public opinion news from a Xinlang microblog and news website by using crawlers, and preprocessing the crawlers to obtain news data;

using an open source tool HanLP to perform word segmentation on the obtained judicial public opinion data and extracting public opinion elements to obtain an element set;

and carrying out word frequency statistics on the public opinion element set.

Further, the specific steps of Step2 are as follows:

if the database does not have the hot event, defining the elements with the word frequency larger than or equal to the threshold as hot words, and establishing the corresponding relation between each hot word and the public sentiment news of the source of the hot word;

and calculating the similarity between the public sentiment news corresponding to each hot word through a similarity calculation system, and merging the public sentiment news with the result of 'True' returned by the similarity calculation system and the corresponding hot words into a hot event.

Further, the specific steps of Step4 are as follows:

establishing a corresponding relation between the hot words in the hot word set obtained in Step3 and the public sentiment news from the hot words to obtain a hot word-public sentiment news set;

carrying out similarity calculation on all public sentiment news under the hot word-public sentiment news set and the public sentiment news under the original hot event through a similarity calculation system;

if the final returned result is 'True', merging the public sentiment news and the corresponding hot words thereof with the public sentiment news and the hot words under the original hot events;

and if the final returned result is 'False', newly adding a new hot event by using the public sentiment news and the corresponding hot words.

Further, the similarity calculation system includes the following subsystems:

a text similarity calculation system based on public opinion news text element statistics;

the public sentiment elements of each public sentiment text are counted, the similarity between the two public sentiment texts is calculated by utilizing the co-occurrence of the element words of the two public sentiment texts, if the similarity is greater than or equal to a threshold value, the True is returned, and otherwise, the False is returned;

a BERT-based public opinion title similarity calculation system;

representing each public opinion title by using word vectors pre-trained by BERT, calculating the similarity between two titles by calculating the Euclidean distance between two title texts, if the similarity is greater than or equal to a threshold value, returning to 'True', otherwise, returning to 'False';

a public opinion title similarity calculation system based on tf-idf;

representing each public opinion title by utilizing tf-idf, calculating the similarity between two titles by calculating the cosine similarity between two title texts, if the similarity is more than or equal to a threshold value, returning to 'True', otherwise, returning to 'False';

and finally, judging according to the results of the three subsystems, and if 2 or more than 2 'True's exist in the three results, considering the two pieces of public opinion texts as similar texts and belonging to the same hot event.

According to the concept of the present invention, the present invention also provides a public opinion hotspot event discovery device in judicial field based on hotspot words, as shown in fig. 4, the device comprising:

the data acquisition module is used for acquiring judicial public sentiment news on the network by utilizing a web crawler technology and preprocessing and segmenting data;

the element extraction module is used for extracting public opinion elements from the obtained judicial public opinion data by using an open source tool HanLP (https:// github. com/hankcs/HanLP);

the word frequency statistic module is used for carrying out word frequency statistics on the public sentiment elements obtained by the element extraction module;

the similarity calculation module is used for calculating the similarity between the public sentiment news by using the elements of the public sentiment news, tf-idf and word vectors pre-trained by BERT;

and the merging module judges whether to merge according to the result of the similarity calculation module.

The beneficial effects of the invention are:

according to the method, the hot events in the judicial field are discovered by defining the hot words, so that the text can be represented by simple words, and the interference of a small amount of words on the representation of the text is prevented; the text similarity calculation system combines word frequency statistics, BERT pre-training word vectors and tf-idf to represent texts, and calculates the similarity of the texts by using Euclidean distance and cosine similarity, thereby improving the accuracy of hot events.

Compared with the traditional subject model, the method and the device provided by the invention have simpler structure, and have higher efficiency and accuracy under the conditions of less data volume, unbalanced distribution and larger noise.

Drawings

FIG. 1 is a schematic diagram of the process steps of the present invention;

FIG. 2 is a schematic flow diagram of a process of the present invention;

FIG. 3 is a schematic flow chart of a similarity calculation system according to the present invention;

fig. 4 is a schematic structural diagram of the device of the present invention.

Detailed Description

Example 1: fig. 1 shows a hot event discovery method in judicial fields based on hot words, fig. 2 is a schematic flow chart of the method of the present invention, fig. 3 is a schematic diagram of a similarity calculation system of the present invention, and fig. 4 is a schematic structural diagram of the apparatus of the present invention.

The method comprises the following specific steps:

step A, crawling judicial public opinion news by using crawler, preprocessing the data, and using an open source tool HanLP (A)https://github.com/hankcs/HanLP) Performing word segmentation on the text, extracting essential elements in Chinese public sentiment news to obtain an element set, and performing word frequency statistics on the elements;

b, if the database does not have a hot event, defining the elements with the word frequency larger than or equal to the threshold in the step A as hot words and classifying the hot words into a hot word set, calculating the similarity between the public sentiments corresponding to the hot words through a similarity calculation system, and determining whether the public sentiments are combined according to the returned result; if the similarity is greater than or equal to the threshold value, the public opinion news and the corresponding hot words are merged into a hot event, and the public opinion news with the similarity smaller than the threshold value is discarded;

step C, if the database has a hot event, defining the elements with the word frequency larger than or equal to the threshold value in the step A as hot words, and classifying the hot words into a hot word set; comparing the elements with the word frequency smaller than the threshold value with the hot words under the original hot event, if the elements appear in the hot word set under the original hot event, defining the elements as the hot words and classifying the elements into the hot word set, otherwise, discarding the elements;

and D, carrying out similarity calculation on the public sentiment news corresponding to the hot words in the hot word set obtained in the step C and the public sentiment news under the original hot event through a similarity calculation system, if the result obtained by the similarity calculation system is 'True', classifying the hot words and the corresponding public sentiment news into the original hot event, and otherwise, combining the hot words and the public sentiment news into a new hot event.

In the step A, the data of the invention mainly comprise Sina microblog and judicial public opinions in various news websites, and after data preprocessing, an open source tool HanLP (https:// github. com/hankcs/HanLP) is used for word segmentation and element extraction. In public opinion news, the most distinguishing is the human name and the organization name, and the place name is too frequent to be used as a public opinion element. Suppose that the crawled public opinion news data set is D ═ D₁,d₂,...,d_NN is the total number of texts. Then the public opinion news data set D is expressed as X ═ { X ═ X by the element set obtained after HanLP extraction of the elements₁,x₂,...x_MAnd M is the number of public sentiment elements. After the element set is obtained, carrying out word frequency statistics on elements in the element set;

in the step B, defining elements with the word frequency larger than or equal to a threshold value in the element set X as hot words, and establishing a hot word set R ═ { R ═ R₁,r₂,...,r_nAnd f, wherein n is the total number of the hot words, and then establishing the corresponding relation between the hot words and the public sentiment text. For example: hot word r₁From public opinion text { d₁,d₂,d₃Is then expressed as r₁→{d₁,d₂,d₃}. According to practical situations, the invention sets the threshold value of the word frequency to be 10. And then, calculating the similarity between the public opinion texts by using a similarity calculation system. The similarity calculation system comprises the following subsystems:

the method comprises the following steps of firstly, calculating the text similarity based on public opinion news text element statistics;

secondly, a public opinion title similarity calculation system based on BERT;

and thirdly, a public opinion title similarity calculation system based on tf-idf.

The following describes the specific embodiments of the three subsystems:

the text similarity calculation system based on the text element statistics of the public opinion news mainly uses elements in the public opinion news to calculate, and the calculation formula is as follows:

where F represents the similarity of two texts, x_i∩x_jRepresents the number of intersection of elements in two public sentiment texts, | x_maxRepresenting the maximum number of elements in the two public opinion texts.

If the similarity of the two texts is larger than or equal to the threshold value, the two texts are considered to belong to the same hot spot event, the "True" is returned, otherwise, the "False" is returned, and the threshold value is set to be 0.4.

The public opinion title similarity calculation system based on BERT mainly uses Chinese word vectors pre-trained by a BERT model proposed by Google and Euclidean distance to calculate the similarity of texts, and the specific flow is as follows:

firstly, constructing a word list according to word segmentation results, wherein each word has a corresponding ID;

then, setting the text length (L) to be 15, cutting off the text with the word number exceeding L, and filling the text with the word number smaller than L with 0 at the end of the text;

text is characterized using the open source tool BERT (https:// github. com/ymcui/Chinese-BERT-wwm), each word being represented as a vector

Where D is the dimension of the word vector. Then the text

The word vectors are then added for each text word by word, i.e.

Thus, finally

And then, measuring the similarity of the two texts by using Euclidean distance, wherein the Euclidean distance is the most common distance calculation formula, and the measured Euclidean distance is the absolute distance between each point in the multi-dimensional space, and the calculation formula is as follows:

if the similarity of the two texts is larger than or equal to the threshold value, the two texts are considered to belong to the same hot spot event, the "True" is returned, otherwise, the "False" is returned, and the threshold value is set to be 0.5.

The public opinion title similarity calculation system based on tf-idf mainly uses tf-idf to represent texts and uses cosine similarity to calculate similarity between the texts, and the specific flow is as follows:

then setting the text length (L) as 15, truncating the text with the number of words exceeding L, and filling the text with the number of words less than L with 0 at the end of the text;

and calculating the weight corresponding to each word by using a tf-idf formula, wherein td-idf comprises tf (word frequency) and idf (inverse document frequency), and the calculation formula is as follows:

wherein n is_i,jIndicates that the word is in document d_jNumber of occurrences, ∑_kn_k,jThen represents document d_jThe sum of the times all words in (a) appear.

Where | D | represents the total number of documents in the corpus, | { j: t |, where_i∈d_jDenotes an inclusion of a word t_iTo avoid the case where the denominator is 0, so one is added to the denominator entry. Finally, the text may be represented by:

s＝tf_ij×idf_i

therefore, the cosine similarity of two public opinion titles can be calculated by the following formula:

if the similarity of the two texts is larger than or equal to the threshold value, the two texts are considered to belong to the same hot spot event, the "True" is returned, otherwise, the "False" is returned, and the threshold value is set to be 0.75.

Finally, the final result is determined by the return values of the three subsystems, and the judgment rule is as follows: and if two or more returned results in the three subsystems are 'True', the two texts are considered to belong to the same hot event, and the 'True' is returned, otherwise, the 'False' is returned.

the step C comprises the following steps: c01 step: defining the elements with the word frequency larger than or equal to a threshold value in the element set X as hot words and establishing a hot word set; c02 step: judging whether the element in the element set X is a hot word under the original hot event, wherein the word frequency of the element is less than the threshold value; c04 step: and establishing a corresponding relation between the obtained hot words and the source public sentiment news thereof.

In step C01, the element set X obtained in step a is set to { X ═ X₁,x₂,...x_MDefining elements with word frequency larger than or equal to a threshold value as hot words, and establishing a hot word set R (R)₁,r₂,...,r_nN is the total number of hot words, where the threshold is also set to 10;

in step C02, a hot word set R 'under the original hot event is obtained, elements with a word frequency less than a threshold in the element set X are compared with the hot words in R', and if the elements appear in the hot words under the original hot event, X is the element_iE R', then the element is still defined as a hotspot word and is included in the hotspot word set R. Since judicial cases involve a plurality of events such as case occurrence, police arrest, court trial and second trial, the setting here is to prevent a case from being defined as a new hotspot event due to the time interval of the event;

in step C03, the obtained hot word set R ═ { R ═ R₁,r₂,...,r_nThe corresponding relation r is established between the hot words under the situation and the public sentiment news from the hot words_i→{d₁,d₂,...,d_n}. For example: hot word r₁E.g. R is public sentiment text { d₁,d₂,d₃The element in which the word frequency is greater than the threshold value appears in (i) represents this relationship as r₁→{d₁,d₂,d₃}。

In the step D, the corresponding relation r between all the hot words and the public sentiment text obtained in the step C is utilized_i→{d₁,d₂,...,d_nAnd judging whether the two texts belong to the same hot spot event or not through the similarity calculation system in the step B. If the returned result is "True", the hot word r is added_iAnd its corresponding public opinion text { d₁,d₂,...,d_nMerge into the compared original hotspot event; otherwise, calculating similarity between every two public sentiment texts again to combine the two public sentiment texts, combining the public sentiment text with the returned result of "True" and the hot words thereof into a new hot event, and otherwise, adding the public sentiment text into a new hot event.

According to the concept of the present invention, the present invention also provides a device for discovering public opinion hotspot events in judicial fields based on hotspot words, as shown in fig. 4, the device comprising:

While the present invention has been described in detail with reference to the embodiments, the present invention is not limited to the embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.

Claims

1. A judicial domain hot event discovery method based on hot words is characterized by comprising the following steps:

the method comprises the following specific steps:

step2, if the database does not have a hot event, defining the element with the word frequency more than or equal to the threshold value in Step1 as a hot word, calculating the similarity between the public sentiments corresponding to each hot word through a similarity calculation system, and determining whether the public sentiments are merged according to the returned result;

step3, if the database has a hotspot event, defining the elements with the word frequency larger than or equal to the threshold value in Step1 as hotspot words, and classifying the hotspot words into a hotspot word set; comparing the elements with the word frequency smaller than the threshold value with the hot words under the original hot event, if the elements appear in the hot word set under the original hot event, defining the elements as the hot words and classifying the elements into the hot word set, otherwise, discarding the elements;

and Step4, carrying out similarity calculation on the public sentiment news corresponding to the hot words in the hot word set obtained in Step3 and the public sentiment news under the original hot events through a similarity calculation system, and determining whether the public sentiment news belongs to the original hot events or new hot events according to the calculation result.

2. The judicial domain hotspot event discovery method based on hotspot words of claim 1, wherein: the specific steps of Step1 are as follows:

and carrying out word frequency statistics on the public opinion element set.

3. The judicial domain hotspot event discovery method based on hotspot words of claim 1, wherein: the specific steps of Step2 are as follows:

4. The judicial domain hotspot event discovery method based on hotspot words of claim 1, wherein: the specific steps of Step4 are as follows:

establishing a corresponding relation between the hot words in the hot word set obtained in Step3 and the public sentiment news of the source thereof to obtain a hot word-public sentiment news set;

5. The judicial domain hotspot event discovery method based on hotspot words of claim 1, wherein: the similarity calculation system comprises the following subsystems:

a BERT-based public opinion title similarity calculation system;

a public opinion title similarity calculation system based on tf-idf;