CN114385912A

CN114385912A - Method for judging place where internet public opinion information occurs

Info

Publication number: CN114385912A
Application number: CN202111605994.6A
Authority: CN
Inventors: 张皓帆
Original assignee: Xi'an Kangnai Network Technology Co ltd
Current assignee: Xi'an Kangnai Network Technology Co ltd
Priority date: 2021-12-25
Filing date: 2021-12-25
Publication date: 2022-04-22

Abstract

The invention discloses a method for judging a generation place of Internet public opinion information, which relates to the technical field of data processing, wherein a multi-azimuth region attribute system is utilized to construct a region attribute library, multi-aspect information matching is carried out on public opinion information texts, the screening flexibility is improved by utilizing threshold setting, the advantages of high accuracy and high matching degree are realized, the generation place of the Internet public opinion data is accurately and automatically judged, the workload is reduced, and the timeliness is improved; the invention provides a method for judging a generation place of Internet public opinion information, which utilizes multi-directional regional attribute system setting to carry out multi-aspect matching, utilizes threshold value setting to improve screening flexibility, and accurately and automatically judges the generation place of the Internet public opinion data.

Description

Method for judging place where internet public opinion information occurs

Technical Field

The invention relates to the technical field of data processing, in particular to a method for judging a place where internet public opinion information occurs.

Background

With the rapid development of the internet in the global scope, the network media has gradually become the "first media" beyond newspapers, broadcasting and television, and becomes one of the important carriers reflecting the social public opinion information, and the openness and the virtualization of the network determine that the network public opinion has the characteristics of directness, burstiness and information diversity. According to the display of 'Chinese Internet development statistics report' of 47 th phase issued by a Chinese Internet information center (CNNIC), as 12 months and 20 days in 2020, the number of Chinese Internet users reaches 9.89 hundred million, public opinion data events caused by the Internet are gradually increased along with the increase of the number of Chinese netizens, and part of public opinion information has certain negative performance, so that accurate information needs to be obtained in time and is convenient to process in time.

For each regional management layer, the method can accurately and effectively discover the local related public sentiment event information, can improve the processing progress of related departments and personnel on the public sentiment events, and can manage and dredge the bad information of the internet in time.

The existing automatic judgment method for the public opinion information generation place of the internet is simple and traditional, and mainly identifies and determines the main area of the public opinion generation according to regional administrative nouns. However, in the current complex internet environment, the public opinion data information source has multi-channel property, information title ambiguity and multi-region relevance of information content, so that the public opinion occurrence place is judged only by regional administrative nouns, the judgment result has large error, a processing person is required to manually screen out relevant occurrence place information again from a large amount of information, and the work efficiency is seriously reduced.

The method for judging the internet public opinion information occurrence place comprises the steps of setting a multi-directional region attribute system to carry out multi-aspect matching, setting a threshold value to improve screening flexibility, and accurately and automatically judging the internet public opinion information occurrence place.

Disclosure of Invention

The invention aims to provide a method for judging a generation place of internet public opinion information, which utilizes multi-directional regional attribute system setting to carry out multi-aspect matching, utilizes threshold value setting to improve screening flexibility, and accurately and automatically judges the generation place of the internet public opinion data.

The invention provides a method for judging a place where internet public opinion information occurs, which comprises the following steps:

establishing a region attribute library, wherein the region attribute library comprises: the system comprises regional administrative information, and regional building information, regional scenic spot information, regional culture information, regional website information and regional enterprise information which correspond to the regional administrative information;

acquiring public opinion information texts;

carrying out word segmentation on sentences in the public opinion information text to obtain word segmentation combinations;

filtering the word segmentation combination to obtain an optimal word segmentation result;

matching the words in the optimal word segmentation result with the information in the region attribute library respectively, and outputting the successfully matched words;

determining the type of public sentiment information as one or more of administrative information, building information, scenic spot information, cultural information, website information or enterprise information according to the successfully matched words;

setting priority weights of administrative information, building information, scenic spot information, cultural information, website information and enterprise information in public opinion information;

counting the occurrence times of information with the highest priority in the public opinion information, multiplying the occurrence times by corresponding weight values, and adding multiplication results to obtain the sum of the products of the weights;

and setting a judgment threshold value, comparing the sum of products of the weights with the judgment threshold value, if the sum of products of the weights is greater than or equal to the judgment threshold value, obtaining a public opinion information generation place according to the information with the highest priority, and if the sum of products of the weights is less than the judgment threshold value, obtaining the public opinion information generation place according to the information with the second priority.

Further, the step of segmenting the sentences in the public opinion information text to obtain the segmentation results comprises:

adopting an Mmseg algorithm to carry out word segmentation on sentences needing word segmentation in a public opinion information text according to a left-to-right sequence;

and recognizing the word segmentation combinations of all 3 words, and outputting the recognized word segmentation combinations of all 3 words.

Further, the step of filtering the word segmentation result to obtain an optimal word segmentation result includes:

sequentially filtering the word segmentation combinations of all the 3 recognized words by utilizing 4 disambiguation rules of the Mmseg algorithm;

stopping filtering when only one kind of word segmentation combination or 4 disambiguation rules are filtered;

and outputting the optimal word segmentation result after the filtering is finished.

Further, the step of matching the words in the optimal word segmentation result with the information in the region attribute library respectively and outputting the successfully matched words comprises the following steps:

matching each word in the optimal word segmentation result with regional administrative information, regional building information, regional scenic spot information, regional culture information, regional website information and regional enterprise information in a regional attribute library respectively;

and if one or more words in the optimal word segmentation result are successfully matched with the information in the region attribute library, dividing the words in the optimal word segmentation result which is successfully matched into corresponding information types of public opinion information according to the information types in the region attribute library which is successfully matched, and obtaining the words which are successfully matched.

Further, still include:

and if one or more words in the word segmentation result are unsuccessfully matched with the information in the region attribute library, matching the word segmentation result of the next sentence with the information in the region attribute library according to the sentence sequence of the public opinion information text.

Further, the step of counting the occurrence times of information with the highest priority in the public opinion information, multiplying the occurrence times by corresponding weight values, and adding the multiplication results to obtain the sum of the products of the weights includes:

setting words with high priority in matching results as main information attribute words of public opinion information according to priority weights of regional administrative information, regional building information, regional scenic spot information, regional culture information, regional website information and regional enterprise information;

setting a judgment threshold value of the main information attribute words;

counting the occurrence times of main information attribute words in word segmentation results of public opinion information, and determining the weight of the corresponding main information attribute words;

and multiplying the occurrence times of the main information attribute words by the corresponding weight values, and adding the multiplication results to obtain the sum of products of the weights of the main information attribute words.

Further, still include:

comparing the sum of the products of the weights of the main information attribute words with a judgment threshold value of the main information, and if the sum of the products of the weights of the main information attribute words is greater than or equal to the judgment threshold value of the main information, determining a public opinion region according to the main information attribute words;

and if the sum of the products of the weights of the main information attribute words is smaller than the judgment threshold of the main information, performing comparison judgment of the auxiliary information threshold according to priority iteration to obtain a judgment result of the public opinion region.

Further, the comparing and determining of the auxiliary information threshold value according to the priority iteration to obtain the judgment result of the public opinion region includes:

setting a plurality of information words with lower priorities than the main information attribute words in the matching result as auxiliary information attribute words of the public opinion information;

setting a judgment threshold value of the auxiliary information attribute words;

counting the occurrence times of auxiliary information attribute words in word segmentation results of public opinion information, and determining the weight of the corresponding auxiliary information attribute words;

multiplying the occurrence times of the auxiliary information attribute words by corresponding weight values, adding the multiplication results and the sum of the products of the weights of the main information attribute words to obtain the sum of the products of the weights of the auxiliary information attribute words, comparing the sum of the products of the weights of the auxiliary information attribute words with a judgment threshold value of auxiliary information, and if the sum of the products of the weights of the auxiliary information attribute words is greater than or equal to the judgment threshold value of the auxiliary information, determining the public opinion region according to the auxiliary information attribute words;

if the sum of the products of the weights of the auxiliary information attribute words is smaller than the judgment threshold of the auxiliary information, the public opinion region cannot be determined according to the auxiliary information attribute words, and the comparison judgment of the next auxiliary information threshold is carried out according to the priority iteration to obtain the judgment result of the public opinion region.

Compared with the prior art, the invention has the following remarkable advantages:

the method for judging the internet public opinion information generating place establishes a standardized data system, utilizes multi-directional regional attribute system setting to carry out multi-aspect matching, utilizes threshold value setting to improve screening flexibility, and has the advantages of accuracy, matching degree and the like compared with the prior art, accurately and automatically judges the internet public opinion information generating place, reduces workload and improves timeliness.

The invention provides a method for judging the occurrence place of Internet public opinion information, which comprises the steps of utilizing a matching algorithm to divide sentences into words, adopting 4 disambiguation rules of the Mmseg algorithm to filter word combinations to obtain an optimal word division result, matching the optimal word division result with information in a region attribute library, representing certain information of public opinions according to the matching result, and judging whether a judgment result of the occurrence place of the public opinion information can be obtained according to the certain information of the public opinions by setting a judgment threshold value.

Drawings

Fig. 1 is a block diagram illustrating public opinion information generation location determination according to an embodiment of the present invention;

FIG. 2 is a block diagram of automatic judgment of public sentiment data in Shanxi region according to an embodiment of the present invention;

fig. 3 is a block diagram of automatic determination of public opinion data in Xinjiang area according to an embodiment of the present invention.

Detailed Description

The technical solutions of the embodiments of the present invention are clearly and completely described below with reference to the drawings in the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

The Mmseg algorithm is a Chinese word segmentation algorithm based on character string matching (also called dictionary-based), and is mainly used for identifying a plurality of different 3 word combinations from one sentence needing word segmentation in a left-to-right sequence each time. Then according to 4 disambiguation rules: and selecting an optimal word segmentation combination according to the maximum matching, the maximum average word length, the minimum word length variance and the maximum single word freedom degree. The main idea is to identify a plurality of different 3 word combinations from a complete sentence in left-to-right order, and then determine the best alternative word combination according to 4 disambiguation rules. Selecting the 1 st word in the alternative word combination as the word segmentation result of 1 iteration; the remaining words (i.e., the remaining portion of the sentence other than the first word that has been segmented) continue with the next round of segmentation. Compared with the traditional forward maximum matching algorithm, the algorithm adds context information, and solves the problem that only words are considered in each word selection and the context-related words are ignored.

The region attribute library is mainly set according to the current region characteristics and the corresponding content structure in the region attribute library according to the information of region administration, region building scenic spots, region culture, region websites, region nationalities and the like, and the priority and the judgment threshold are set for the content of the set structure, and the weight calculation value is given to the attribute words in the public sentiment information data.

Referring to fig. 1-3, the invention provides a method for judging a place where internet public opinion information occurs, comprising the following steps:

constructing a region attribute library, wherein the region attribute library comprises: setting attribute words of a region attribute library according to region characteristics, such as region administrative information referring to administrative nouns of each region, region building information, region scenic spot information, region culture information, region website information and region enterprise information corresponding to the region administrative information; the regional scenery information refers to famous scenery nouns of each region; the regional website information refers to the regional websites;

acquiring public opinion information texts;

segmenting the sentences in the public opinion information text according to the sequence from left to right by adopting a matching algorithm, identifying segmentation combinations of all 3 words, and outputting the identified segmentation combinations of all 3 words;

sequentially filtering all the recognized word segmentation combinations of 3 words by using 4 disambiguation rules of the Mmseg algorithm, stopping filtering when only one word segmentation combination or 4 disambiguation rules are filtered, and outputting an optimal word segmentation result after filtering is finished;

matching the words in the optimal word segmentation result with region administrative information, region building information, region scenic spot information, region culture information, region website information and region enterprise information in a region attribute library respectively, if one or more words in the optimal word segmentation result are successfully matched with the information in the region attribute library, dividing the words in the successfully matched optimal word segmentation result into corresponding information types of public sentiment information according to the information types in the successfully matched region attribute library to obtain the successfully matched words, and if one or more words in the word segmentation result are not successfully matched with the information in the region attribute library, matching the next sentence word segmentation result with the information in the region attribute library according to the sentence sequence of the public sentiment information text to output the successfully matched words;

setting a word with high priority in the matching result as a main information attribute word of the public opinion information, setting a judgment threshold of the main information attribute word, counting the occurrence times of the main information attribute word in the word segmentation result of the public opinion information, determining the weight of the corresponding main information attribute word, multiplying the occurrence times of the main information attribute word with the corresponding weight value, and adding the multiplication results to obtain the sum of products of the weights of the main information attribute word;

comparing the sum of the products of the weights of the main information attribute words with a judgment threshold value of the main information, if the sum of the products of the weights of the main information attribute words is greater than or equal to the judgment threshold value of the main information, determining a public opinion region according to the main information attribute words, and if the sum of the products of the weights of the main information attribute words is less than the judgment threshold value of the main information, performing comparison judgment on an auxiliary information threshold value according to priority iteration to obtain a judgment result of the public opinion region;

setting a plurality of information words with the priority lower than that of the main information attribute word in the matching result as auxiliary information attribute words of the public opinion information, setting a judgment threshold value of the auxiliary information attribute words, counting the occurrence times of the auxiliary information attribute words in the word segmentation result of the public opinion information, determining the weight of the corresponding auxiliary information attribute words, multiplying the occurrence times of the auxiliary information attribute words by corresponding weight values, adding the multiplication results and the sum of the weights of the main information attribute words to obtain the sum of the products of the weights of the auxiliary information attribute words, comparing the sum of the products of the weights of the auxiliary information attribute words with the judgment threshold value of the auxiliary information, if the sum of the products of the weights of the auxiliary information attribute words is greater than or equal to the judgment threshold value of the auxiliary information, determining the public opinion region according to the auxiliary information attribute words, if the sum of the products of the weights of the auxiliary information attribute words is less than the judgment threshold value of the auxiliary information, determining the public opinion region according to the auxiliary information attribute words, and comparing and judging the next auxiliary information threshold value according to the priority iteration to obtain a judgment result of the public opinion region.

Example 1

Obtaining the optimal word segmentation result after the filtering is finished by utilizing 4 disambiguation rules of a matching algorithm and an Mmseg algorithm, wherein the method comprises the following steps:

the matching algorithm adopts a Simple method or a complete method:

the Simple method comprises the following steps: i.e. a simple forward match, listing all possible results, such as "international metropolis", from the first word, one can obtain: national, international and internationalized;

complete method: all the phrases of three words are matched, that is, all possible combinations of three words are obtained from a certain predetermined word as the initial position. Such as "study life origin", these phrases may be found study _ birth, study _ life, study _ life _ origin, and so on.

The 4 disambiguation rules of the Mmseg algorithm are:

rule 1, maximum sum of lengths of alternative word combinations: for the "simple" matching method, the word with the largest length is selected, for example, the longest "internationalized" matching result is selected; for the "complete" matching method, select the phrase "with the largest phrase length" and then select the first word of this phrase as the first word to be segmented, for example, the "researcher" in "researcher _ life _ origin" or the "research" in "research _ life _ origin";

rule 2, average word length of alternative word combinations is maximum: after rule 1 filtering, if the remaining phrases exceed 1, the one with the largest average word length is selected (average word length is the total word number of phrases/number of words). Such as "living standard", the following phrases may be derived: according to the rule, the phrase "living _ level" can be determined to be selected from living _ level (4/3 ═ 1.33), living _ level (4/3 ═ 1.33), and living _ level (4/2 ═ 2);

rule 3, word length variation of alternative word combinations is minimal: since the rate of change of the word length can be reflected by the standard deviation, the standard deviation formula is directly applied here. For example, for "research life origin" there are: study _ life _ origin (standard deviation ^ sqrt (((2-2) ^2+ (2-2) ^2+ (2-2^2))/3) ═ 0), study _ life _ origin (standard deviation ═ sqrt (((2-3) ^2+ (2-1) ^2+ (2-2) ^2)/3) ^ 0.8165), then select the phrase "study _ life _ origin" through this rule;

rule 4, in the alternative word combination, the statistical value of the occurrence frequency of the single word is the highest: and calculating the natural logarithm of the word frequency of all the single words in the word group, then adding the obtained values, and taking the word group with the maximum sum. Such as: "facility and service", this will be in the following combination: facility _ and _ service _, set _ implement _ and _, filtered by rule 1 to get: the facility _ and _ service _ and the facility _ and _ service _, and the rule 2 and the rule 3 can not obtain unique results, only the "service" in the first item and the "sum" in the second item of the last rule can be used, and it is obvious from a visual sense that the word frequency of the "sum" is higher in a daily scene, and the word frequency depends on the participle determined by the word frequency dictionary sum. Assuming that the frequency when "business" is a single word is 30, and "frequency when it is a single word is 100, the natural logarithm is taken for 30 and 100, and then the maximum value is taken, so the phrase where" and "word is located, i.e.," facility _ and _ service ", is taken.

And filtering the words according to 4 disambiguation rules in sequence to obtain the optimal word segmentation combination.

Example 2

Referring to fig. 2, automatic judgment of public opinion data in shanxi region is performed.

Setting attribute words of a region attribute library, and according to the Shanxi region characteristics: geographical administrative information (Shaanxi, Xian, Xiyang), geographical scenic spot information (terracotta warriors, wild goose towers, Huashan), and geographical website information (West-North-West Shang network);

carrying out Chinese word segmentation on the content of the public opinion information text by using a matching algorithm in the Mmseg algorithm, and generating the best word segmentation result of the public opinion information text according to 4 disambiguation rules of the Mmseg;

matching the word segmentation result of the public opinion information text with a region attribute library, and determining region administrative information, region scenic spot information and region website information of the public opinion information according to successfully matched words;

setting priorities and primary and secondary properties of regional administrative information, regional scenic spot information and regional website information in public opinion information, wherein the regional administrative information and the regional scenic spot information are primary information attribute words, the regional website information is secondary information attribute words, and a double determination threshold S of the primary information is set for the regional administrative information and the regional scenic spot information;

setting weighted values of regional administrative information, regional scenery spot information and regional website information;

counting the occurrence times a1, a2 and a3 of main information attribute words in word segmentation results of public opinion information, determining weights b1, b2 and b3 corresponding to the main information attribute words, multiplying the occurrence times of the main information attribute words by corresponding weight values, and adding multiplication results to obtain the sum of products of the weights of the main information attribute words, namely a1 b1+ a2 b2+ a3 b 3;

and comparing the sum of the products of the weights of the main information attribute words with a judgment threshold of the main information, if the sum of the products of the weights of the main information attribute words is greater than or equal to the judgment threshold of the main information, determining a public opinion region according to the main information attribute words, and if the sum of the products of the weights of the main information attribute words is less than the judgment threshold of the main information, performing comparison judgment on an auxiliary information threshold according to priority iteration to obtain a judgment result of the public opinion region.

Example 3

Referring to fig. 3, the public sentiment data of Xinjiang region is automatically judged.

Setting regional module attribute words according to the characteristics of Xinjiang region: administrative (Xinjiang, Wuluqie, autonomous region), national (Hui nationality, Uygur nationality), cultural (Chebular, Xinjiang);

matching the word segmentation result of the public opinion information text with a region attribute library, and determining region administrative information, region national information and region culture information of the public opinion information according to successfully matched words;

setting the priority and the dominance of regional administrative information, regional national information and regional culture information of public opinion information, wherein the regional national information is main information, the regional culture information is auxiliary information, and double thresholds are set for the administrative information;

setting a weight value of the attribute word;

taking a word segmentation result generated by a public sentiment information text as a reference, setting according to the priority of a regional attribute module, taking an administrative library as a reference in the first round, performing text matching on attribute words of Xinjiang, Wuluqiquan and autonomous regions, and performing text matching on the attribute words of Xinjiang, Wuluqiqi and Xinjiang by using a cutting cake and an auxiliary library to count the hit times of related attribute words;

and performing statistical calculation according to the attribute word statistical result and the weighted value, and generating different attribute matching thresholds by referring to the hit result of the attribute words of the auxiliary library.

Setting a threshold value in a current attribute library for judgment, hitting, and positioning public opinion regions; and if the public opinion information is not hit, performing second round of positioning by taking the national database as a reference to obtain a judgment result of the public opinion information occurrence place.

The above disclosure is only for a few specific embodiments of the present invention, however, the present invention is not limited to the above embodiments, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present invention.

Claims

1. A method for judging a place where Internet public opinion information occurs is characterized by comprising the following steps:

acquiring public opinion information texts;

2. The method as claimed in claim 1, wherein the step of segmenting words of the sentences in the text of the public opinion information to obtain the segmentation result comprises:

3. The method as claimed in claim 2, wherein the step of filtering the segmentation result to obtain the optimal segmentation result comprises:

4. The method as claimed in claim 1, wherein the step of matching words in the optimal word segmentation result with information in a regional attribute library and outputting successfully matched words comprises:

5. The method for determining the occurrence location of internet public opinion information according to claim 4, further comprising:

6. The method as claimed in claim 1, wherein the step of counting the number of occurrences of information with highest priority among the public opinion information, multiplying the number of occurrences by corresponding weight values, and adding the multiplication results to obtain the sum of products of weights comprises:

setting a judgment threshold value of the main information attribute words;

7. The method for determining the occurrence location of internet public opinion information according to claim 6, further comprising:

8. The method as claimed in claim 7, wherein the iteratively performing comparison and determination of the auxiliary information threshold according to the priority to obtain a determination result of the public opinion region comprises: