CN112860906B - Market leader hot line and public opinion decision support method and system based on natural language processing - Google Patents

Market leader hot line and public opinion decision support method and system based on natural language processing Download PDF

Info

Publication number
CN112860906B
CN112860906B CN202110440120.3A CN202110440120A CN112860906B CN 112860906 B CN112860906 B CN 112860906B CN 202110440120 A CN202110440120 A CN 202110440120A CN 112860906 B CN112860906 B CN 112860906B
Authority
CN
China
Prior art keywords
hot
public opinion
natural language
language processing
line
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110440120.3A
Other languages
Chinese (zh)
Other versions
CN112860906A (en
Inventor
张子成
曹伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Huiningjie Information Technology Co ltd
Original Assignee
Nanjing Huiningjie Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Huiningjie Information Technology Co ltd filed Critical Nanjing Huiningjie Information Technology Co ltd
Priority to CN202110440120.3A priority Critical patent/CN112860906B/en
Publication of CN112860906A publication Critical patent/CN112860906A/en
Application granted granted Critical
Publication of CN112860906B publication Critical patent/CN112860906B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a crime hot-line public opinion decision support method and system based on natural language processing, which help the crime hot-line to master the public opinion trend and carry out scientific decision, excavate daily hot-line events based on natural language processing, simplify and classify the hot-line events, then retrieve a worksheet containing hot-line keywords by adopting a Hash index method capable of matching multiple keywords, extract the abstract of the retrieved worksheet by adopting an improved TextRank algorithm, feed back the most key information in the worksheet, and enable the staff of the hot-line to timely know the daily complaints and the hot-line public opinion trend and report the complaints to related departments. By utilizing the system, the chief hot-line staff changes manual combing of public opinion hotspot information into automatic public opinion information automatic mining and displaying, thus greatly improving the working efficiency of the staff, finding out civil problems in time through the system and taking active countermeasures.

Description

Market leader hot line and public opinion decision support method and system based on natural language processing
Technical Field
The invention belongs to the technical field of artificial intelligence and machine learning, and particularly relates to a city leader hot line public opinion decision support method and system based on natural language processing.
Background
With the increasing of the consciousness of maintaining the right of the citizen, the civic hotline gradually becomes an important channel for public interest expression, emotion disclosure and thought collision. Under the action of complex social environment and benefit conflict, the network public sentiment of the emergency happens sometimes, and the negative effect of the network public sentiment after the emergency public affair happens is amplified more easily, so that more personal problems are generalized and complicated, and the social contradiction is aggravated, thereby triggering the chain reaction of the public crisis. The traditional public opinion management modes such as sealing, blocking, desert and the like not only eradicate the public opinion crisis but also possibly further damage the image. Therefore, the early prejudgment and treatment of public sentiment outbreak are very important.
With the continuous development of information technology, more and more cases are used for assisting departments in making scientific decisions by using artificial intelligence and machine learning technology. Natural language processing is an important branch of artificial intelligence that can process and analyze natural language with computer technology.
The department of industry and telecommunications of 12 months in 2017 published ' promotion of three-year action plans for the development of the new generation of artificial intelligence industry (2018 & year 2020) ], and specifically mentions ' encouragement department takes precedence in using artificial intelligence to improve business efficiency and manage service level '. In the current society, the chief staff hot line needs to respond to the 'total customer service' demanded by each party well, the convenience and satisfaction of the masses and the market main body are used as measuring standards, the chief staff hot line handling quality and efficiency and the system intelligentization level are improved, and a closed loop with quick response, efficient handling, tracking and supervision, timely feedback and analysis promotion is formed. The hotline information resource is a window for timely knowing the civil situation, and is a food for government work and decision.
At present, the degree of mining, developing and utilizing data values by the market leader hot line is not high, and the utilization and the exploration of hot line information resources are in a preliminary stage. The invention designs a public opinion decision support system for the market leader hot line by using advanced technologies such as big data mining, machine learning and the like, and provides powerful guarantee for early discovery and timely treatment of the public opinion crisis.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems, the invention provides a city leader hot-line public opinion decision support method and system based on natural language processing, wherein daily hot-line events are mined based on natural language processing, the hot-line events are simplified and classified, then a work order containing hot-line keywords is retrieved by adopting a Hash index method matched with multiple keywords, an improved TextRank algorithm is adopted to extract an abstract of the retrieved work order, the most key information in the work order is fed back, and workers of the hot-line can timely know the daily complaint hot-line and public opinion trends and report the complaints to relevant departments.
The technical scheme is as follows: in order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows:
a city leader hot line public opinion decision support method based on natural language processing comprises the following steps:
(1) mining daily hot events of the city chief hot line based on natural language processing;
(2) simplifying and classifying the hot events based on the cosine similarity;
(3) a Hash index method matched with multiple keywords is adopted to retrieve a complaint work order containing the keywords of the hot event;
(4) extracting the abstracts of the retrieved complaint worksheets by adopting an improved TextRank algorithm;
(5) and the hot line staff knows the daily public opinion hotspot events according to the summary report and reports the public opinion hotspot events to related departments.
Further, the step (1) specifically includes the steps of:
(1.1) carrying out word segmentation and keyword extraction on the complaint work order, wherein the keyword extraction method is a TF-IDF algorithm;
(1.2) constructing a keyword FpTree;
and (1.3) mining a keyword frequent item set based on the keyword FPTree.
Further, the step (1.2) of constructing the keyword FpTree includes the steps of:
1) setting minimum absolute support, scanning data records, generating a first-level frequent item set of keywords, and sequencing according to the occurrence times from more to less;
2) and scanning the data records again, and sequencing the first-level frequent item sets of the keywords generated in the step 1) appearing in each record according to the sequence of the step 1).
Further, the step (1.3) of mining the keyword frequent item set based on the keyword FPTree includes the steps of:
1) constructing a condition mode base, wherein the condition mode base is a prefix path of an item set to be mined;
2) constructing a condition FPTree;
3) and recursively mining on the condition FPTree.
Further, the step (2) specifically includes the steps of:
(2.1) setting a plurality of hot spot classification marks, wherein each hot spot classification mark comprises a plurality of keywords;
(2.2) calculating cosine similarity between all keywords of the hot spot classification marks and the frequent item set of the keywords;
and (2.3) finding the hot spot classification mark with the largest cosine similarity with the keyword frequent item set in the hot spot classification marks, and marking the hot spot classification mark on the keyword frequent item set.
Further, the step (2.2) of calculating the cosine similarity comprises the steps of:
1) processing the keyword frequent item set into One-Hot codes;
One-Hot encoding is the representation of classifying variables as binary vectors, mapping the classified values to integer values, and then representing each integer value as a binary vector;
2) and performing cosine similarity calculation on the keyword frequent item set of the One-Hot coding.
Further, in the step (3),
preprocessing a text database of the captain hot-line complaint work order to obtain a hash table with keywords corresponding to the work order number, and then searching the work order by using the only main key of the work order through multi-keyword search.
Further, the step (4) specifically includes the steps of:
(4.1) processing the work order content into a text containing a plurality of sentences, and converting the sentences into sentence vectors which can be understood by a machine;
(4.2) calculating cosine similarity between sentence vectors to obtain a similarity matrix as edge weight; adopting TF-IDF score as an initial weight value;
(4.3) carrying out TextRank iteration, and calculating the TextRank value of each sentence to obtain a sentence rank; and extracting the automatic abstracts according to the sentence ranking.
A system for supporting the hot-line public opinion decision of the market leader based on natural language processing comprises a basic layer, a data layer, a supporting layer, an application layer, a service layer and a user layer;
the basic layer is a hardware setting for project implementation, and comprises a computer room and a network environment;
the data layer comprises a basic library and an intelligent library; the basic library is primary data which is original work order text information; the intelligent database is secondary data and is a processed database;
the support layer is an algorithm and application service, and comprises FP-Growth, hot spot problem classification, information retrieval and automatic abstract extraction;
the application layer is a specific application service, and comprises intelligent public opinion supervision, intelligent civil perception and intelligent decision support;
the service layer comprises a web end and a mobile end;
the user layer comprises leaders, service personnel and operation and maintenance personnel.
Has the advantages that: according to the invention, based on natural language processing, daily hot spot data of a hotline is mined, hot spots are simplified and classified, then a Hash index method capable of matching multiple keywords is adopted to retrieve a work order containing the hot spot keywords, an improved TextRank algorithm is adopted to extract an abstract of the retrieved work order, the most key information in the work order is fed back, and workers of the hotline can timely know the daily complaint hot spot and public opinion direction and report the complaints to relevant departments.
The invention adopts natural language processing technology to develop a set of public opinion decision support system for the hot line of the chief in the city, the system can automatically dig public opinion hotspots every day through a machine learning algorithm, and set a threshold value to carry out public opinion alarm, so as to guide workers to make decision support. The automatic abstract extraction method is adopted to help workers to master the most core and important problems from a large number of related work orders, and the public opinion daily newspaper, weekly newspaper and monthly newspaper are conveniently written and pushed to related departments by self.
The decision support system can display the time of searching the group event from the massive complaint work orders by the staff from the original 2 days to the current real time, greatly improves the working efficiency of the staff and improves the processing speed of public problems. After the system is deployed, the chief hot-line staff can monitor public opinion information in real time, the working efficiency of the staff is greatly improved, a series of civil problems such as cell management confusion, WeChat platform fraud, market activity disturbance to residents and the like are found in time through the system, and active countermeasures are taken.
Drawings
Fig. 1 is a flowchart of a city leader hot line public opinion decision support method based on natural language processing according to the present invention;
FIG. 2 is a diagram of a keyword FpTree construction process;
FIG. 3 is a diagram of Hash index storing work order keyword information;
FIG. 4 is a schematic diagram of the embedding layers of the BERT model;
FIG. 5 is a modified TextRank flow diagram;
fig. 6 is a block diagram of the system for supporting a crime hot line public opinion decision based on natural language processing according to the present invention.
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
As shown in fig. 1, the method for supporting a crime hot-line public opinion decision based on natural language processing according to the present invention includes the steps of:
(1) mining hot events of the city leader every day based on natural language processing;
the research object of the invention is a complaint work order of the civic hot line, the keywords of the complaint work order are regarded as an item set, and the hot events of the civic hot line every day, namely frequent item set mining, are mined.
The frequent item set mining comprises the following steps:
(1.1) the storage form of the complaint work order is a Chinese sentence, so that the content of the complaint work order needs to be preprocessed, word segmentation and keyword extraction are carried out on the complaint work order, and the keyword extraction method is a TF-IDF algorithm;
TF-IDF is a statistical method used to evaluate the importance of documents to the corpus.
The TF-IDF calculation formula is as follows:
Figure DEST_PATH_IMAGE001
wherein the content of the first and second substances,N i,j representing keywordsiOn-duty orderjOf the number of occurrences, sigma denotesSome key words in the work orderjThe number of occurrences in (c).
Figure 614422DEST_PATH_IMAGE002
Wherein D represents the total number of work orders,card({j|i∈d i })representation containing keywordsjThe number of workers in the same group.
The TF-IDF value for each keyword is:
TF - IDF=TF×IDF
taking a data set as an example, the data set is shown in table 1.
TABLE 1
OID Keyword Set
O 1 {k 1 , k 2 }
O 2 {k 2 , k 3 , k 4 , k 5 }
O 3 {k 1 , k 3 , k 4 , k 6 }
O 4 {k 2 , k 1 , k 3 , k 4 }
O 5 {k 2 , k 1 , k 3 , k 6 }
In table 1, OID represents the complaint work order ID, and Keyword Set is a Set of keywords extracted from the work order by the TF-IDF algorithm.
(1.2) constructing a keyword FpTree;
the method applies the FPGrowth algorithm to the field of natural language processing, and excavates hot events of the day. The FPGrowth algorithm introduces data structure storage data, and mainly comprises an item head table, an FPTree and a node linked list; I/O operations can be reduced and efficiency improved.
FpTree is a tree structure defined as follows:
FpTree node data structure
FpNode {
idName// id number
List < FpNode > child node;/child node
Fpnode parent// parent node
FpNode next// the next node with the same id number
count;/number of occurrences
}
As shown in fig. 2, the keyword FpTree is constructed, including the steps of:
step 1: assuming that the minimum absolute support is 3, scanning the data records to generate a first-level frequent item set of keywords, and sorting the items according to the occurrence times from high to low, as shown in table 2:
TABLE 2
Keyword Count
k 1 4
k 2 4
k 3 4
k 4 3
As can be seen,k 5 andk 6 are not shown in Table 2 becausek 5 Only 2 times of the occurrence of the disease occur,k 6 the occurrence is only 1 time and is less than the minimum support degree, so the method is not a frequent item set, and according to the Apriori theorem, a superset of the non-frequent item set is not a frequent item set, so the method does not need to be considered again.
Step 2: the data records are scanned again, with the entries in each record appearing in the table generated at Step 1 sorted in the order in the table. Initially, a root node is newly built and marked as null;
1) first recordk 1 , k 2 F, filtering and sorting according to Step 1 table to obtain said leaf opening stillk 1 , k 2 Great, newly building a node, idName being a greatk 1 Inserting it under the root node, setting count to 1, and then creating a new retaining curlk 2 A junction inserted into a retaining openingk 1 Below the junction, the insertion is as shown in (a) of fig. 2.
2) Second recordk 2 , k 3 , k 4 , k 5 Is a great face after being filtered and sequencedk 2 , k 3 , k 4 Finding that the root node does not containk 2 Child of (having a great curl)k 2 Grandson but not son), a chinese curl is createdk 2 A node is inserted below the root node, so that the root node has two children, and then a retaining pocket is newly createdk 3 Inserting the node into a retaining pocketk 2 A new retaining opening below the knot pointk 4 Inserting the node into a retaining pocketk 3 Next, the insertion is as shown in fig. 2 (b).
3) The third recordk 1 , k 3 , k 4 , k 6 Is a great face after being filtered and sequencedk 1 , k 3 , k 4 Great, at this time, a son to a root node is foundk 1 At great opening size, therefore, no new node is needed, only the original one is neededk 1 Adding 1 to the count of the node, finding a specific curl downwardsk 1 Front node has a sonk 3 } then new holderk 3 A junction point inserted into a retaining pocketk 1 Below the junction point, a new retaining opening is createdk 4 Inserting junction into a retaining openingk 3 Behind the junction, the insertion is as shown in fig. 2 (c).
4) The fourth recordk 2 , k 1 , k 3 , k 4 Is a great face after being filtered and sequencedk 1 , k 2 , k 3 , k 4 Great, at this time, a son to a root node is foundk 1 At great opening size, therefore, no new node is needed, only the original one is neededk 1 Adding 1 to the count of the node, finding a specific curl downwardsk 1 Front node has a sonk 2 At great opening, no new retaining opening is requiredk 2 The front map is a Chinese map by using a front mapk 2 Count plus 1 of a junction point, since this archk 2 The node has no son, at this moment, a new retaining opening is neededk 3 A junction inserted into a retaining openingk 2 Below the junction point, a new retaining opening is createdk 4 A junction inserted into a retaining openingk 3 Below the junction, the insertion is as shown in (d) of fig. 2.
5) The fifth recordk 2 , k 1 , k 3 , k 6 A great face after being filtered and sequencedk 1 , k 2 , k 3 Checking to find the root node hask 1 Front child, front openingk 1 Said node has a great facek 2 Front child, front openingk 2 Said node has a great facek 3 And e, inserting the son only by updating the count without newly building a node, wherein the inserting is shown as (e) in fig. 2.
6) According to the above steps, we have basically constructed an fptree (frequency Pattern tree), where each path in the tree represents an item set, because many item sets have common items, and items appearing more frequently are more likely to be common items, so that space can be saved in the order of appearance times from more to less, compressed storage is realized, and in addition, we need a table header and a clue for each node with the same idName, as shown in (f) in fig. 2.
(1.3) mining a keyword frequent item set based on the keyword FPTree;
the FPTree digging process is as follows, digging is started from a frequent pattern with the length of 1, and the digging process can be divided into 3 steps:
1) constructing a Conditional Pattern Base (CPB), wherein the CPB is a prefix path of an item set to be mined;
2) then constructing a Conditional FPTree (Conditional FP-tree) of the FPTree;
3) and recursively mining on the condition FPTree.
A keyword frequent item set mining algorithm:
procedure FP_growth(Tree, α){
if Tree contains single path P
Each combination of nodes in the for path P (denoted as beta)
Generating a pattern β &
}
}
else {
front ai at the head of Tree
A pattern β = ai £ u |, with a support = ai
Construction of the conditional mode base of β followed by construction of the conditional FP Tree β of β
if Treeβ ≠ ∅ then
Call FP _ growth (Tree beta, beta)
}
}
(2) Simplifying and classifying the hot events;
based on the mined keyword frequent item set, since some keyword frequent item sets represent a kind of hot problem, such as:
{keywordA,keywordB,keywordC}
{keywordA,keywordB,keywordC,keywordD}
all belong to a frequent keyword set, but the similarity is high, and all represent a class of hot problems, so that hot events need to be classified.
The most common method for evaluating word vector similarity is cosine similarity, and the method is suitable for similarity calculation among the keyword frequent item sets. The cosine similarity calculation formula of the word vector is as follows:
Figure DEST_PATH_IMAGE003
the steps of calculating the similarity between a pair of keyword frequent item sets are as follows: firstly, processing the key word frequent item set pair into One-Hot code. One-Hot encoding, also known as One-bit-efficient encoding, mainly uses an N-bit state register to encode N states, each state having an independent register bit and only One bit being active at any time. One-Hot encoding is the representation of classification variables as binary vectors. This first requires mapping the classification values to integer values. Each integer value is then represented as a binary vector, which is a zero value, except for the index of the integer, which is marked as 1. And performing cosine similarity calculation on the keyword frequent item set of the One-Hot coding.
Frequent itemsets by keywordsA={keywordA,keywordB,keywordCAnd frequent keyword itemsetB={keywordA,keywordB,keywordC,keywordDAs an example, the method can be used,Ais coded as [1,1,1,0 ]],BIs coded as [1,1,1,1 ] 1]Therefore, it iskeywordsetAAndkeywordsetBthe cosine similarity of (a) is:
Figure 938087DEST_PATH_IMAGE004
the cosine value is close to 1, the more similar the two vectors are, the cosine value is close to 0, the more dissimilar the two vectors are, the more frequent item set of the visible keywordsAAndBor more similar.
The hot event classification based on cosine similarity, because it is not clear in advance that several hot events will occur in the day when the hot event mining of complaints of each day is carried out, the invention designs a simpler and more effective processing method of a frequent keyword item set, which specifically comprises the following steps:
(2.1) setting a plurality of hot spot classification marks, wherein each hot spot classification mark comprises a plurality of keywords;
(2.2) calculating cosine similarity between all keywords of the hot spot classification marks and the frequent item set of the keywords;
and (2.3) finding the hot spot classification mark with the largest cosine similarity with the keyword frequent item set in the hot spot classification marks, and marking the hot spot classification mark on the keyword frequent item set.
The pseudo-code of the algorithm is as follows:
a keyword frequent item set classification algorithm:
inputting: keyword frequent itemset
And (3) outputting: keyword frequent item set with classification label
Hotlist is a list of hot spot classification flags
hotlist.add(keywordset (1))
keywordset (keywordset (1))
Foreach item in keywordset
The cosine similarity of all keyword frequent item sets and items in If hollist is less than threshold ∂
hotlist.add(item)
else
Finding the hot spot subscript with the maximum cosine similarity to the item in the hot list, and marking a subscript label on the item.
(3) A Hash index method capable of matching with multiple keywords is adopted to retrieve a work order containing the keywords of the hot event;
the method is mainly characterized in that the public opinion hotspot event mining by texts is based on a keyword frequent item set, and specific work order content needs to be matched and searched according to multiple keywords.
For a conventional relational database, a large amount of time is consumed for searching keywords in mass data, and due to the appearance of an indexing technology, full-text indexing is added to text data, so that the searching efficiency can be greatly improved.
In the invention, multi-keyword retrieval is adopted, a hash design thought is used for reference, a hot-line complaint work order text database of the captain is preprocessed into a hash table with keywords corresponding to work order numbers, and the work order is retrieved by using the only main key of the work order, so that the retrieval efficiency can be greatly improved.
The hash table is also a hash table, and is directly improved by addressing. In a Hash mode, one elementkAt the position ofh(k)In, i.e. using a hash functionhAccording to the keywordskThe position of the slot is calculated. Function(s)hMapping key fields to hash tablesT[0...m-1]At the slot position. Hash functionhIt is possible to map two different keys to the same location, called a conflict, which is typically resolved in the database using a chaining method. In the chaining method, elements that hash to the same slot are placed in a linked list, as shown in FIG. 3.
The storage structure takes the form of triples, which are the hash function value, the set of id numbers of the work order, and the pointer used to resolve the conflict, respectively.
For example, the set of frequent key wordsk 1 , k 2 Find the hash function valueh(k 1 )The corresponding work order id set isid 1 , id 2 },h(k 2 )The corresponding work order id set isid 2 , id 3 Will contain the keywords at the same timek 1 Andk 2 the work order is
{id 1 , id 2 }∩{id 2 , id 3 }={id 2 }。
(4) Extracting the abstracts of the retrieved work orders by adopting an improved TextRank algorithm, and feeding back the most key information in the work orders;
after the hot event work order is retrieved, automatic abstract extraction is realized on the content of the work order, so that the important direction of the hot event can be effectively mastered, and intelligent decision is realized.
Automatic abstract extraction is an important branch in the field of natural language processing, the current mainstream is text automatic abstract based on a graph model, and the most representative is a TextRank algorithm.
The idea of the TextRank algorithm is that each sentence is given a positive real number to represent the importance degree of the sentence, and the higher the TextRank value is, the more important the sentence is represented, and the more likely the sentence is ranked in the automatic abstract extraction ordering.
Assuming that a text containing several sentences is a directed graph, nodes are sentences, each edge is a transition probability, the transition probability is the similarity between 2 sentences, the sentence jumps to the next sentence with the transition probability, and such random jumps are continuously performed between sentences, and the process forms a first-order Markov chain. After continuous jumping, the Markov chain forms a smooth distribution, the TextRank is the smooth distribution, and the TextRank value of each sentence is the smooth probability.
The formula for TextRank is as follows:
Figure DEST_PATH_IMAGE005
wherein the content of the first and second substances,TR(V i )indicating knotDotV i The rank value of (a) is determined,In(v j )representation nodev j The set of predecessor nodes of (a),Out(v j )representation nodev j The set of successor nodes of (1),nthe number of the sentences is expressed,dis the damping coefficient.
The most core part in the TextRank algorithm is the similarity calculation of edge weight in a graph, and the method for processing sentence similarity comprises the steps of converting sentences into sentence vectors which can be understood by a machine and then calculating the similarity of the sentences. Considering that the original TextRank algorithm is not ideal enough for the edge weight similarity calculation method and the node initial weight assignment processing in graph model construction, a TF-IDF score is used as an initial weight value of a node, an embedding layer of a BERT model is used, as shown in FIG. 4, a sentence is processed into a numeric 768-dimensional sentence vector, cosine similarity is used for calculating similarity between sentences as edge weights, and finally, TextRank iteration is performed, wherein the algorithm steps are shown in FIG. 5. The method specifically comprises the following steps:
(4.1) processing the work order content into a text containing a plurality of sentences; converting the sentence into a sentence vector which can be understood by a machine;
(4.2) calculating cosine similarity between sentence vectors to obtain a similarity matrix as edge weight; adopting TF-IDF score as an initial weight value;
(4.3) carrying out iteration of the TextRank, and calculating the TextRank value of each sentence to obtain a sentence rank; and extracting the automatic abstracts according to the sentence ranking.
(5) The hot-line staff can timely know the public opinion hot events every day and report the public opinion hot events to related departments;
the work order data are stored by adopting MySql, the decision support system mainly utilizes the title and the complaint content of the work order, the data are input into a model center, a hot spot event is obtained through analysis, the hot spot event is stored in a database, the hot spot can be dragged in a self-service mode from a report center to form a daily report, a weekly report and a monthly report, a worker can generate a report template in a self-defined mode according to summary information searched by hot spot keywords and pushes the report template to relevant functional departments, and the functional departments perform rectification or tracking investigation aiming at the problem of public opinion reaction after obtaining the report.
As shown in fig. 6, the system for supporting a crime hot line public opinion decision based on natural language processing according to the present invention includes 6 layers: the system comprises a base layer, a data layer, a support layer, an application layer, a service layer and a user layer.
The base layer provides the hardware settings for the project implementation such as the computer room and the network environment.
The data layer is divided into a basic library and an intelligent library, wherein the basic library is a database in which primary data comprises original work order text information, geographical position information and the like; the intelligent library is secondary data and is a processed database such as a hot-line dictionary library, a statistical word library and the like.
The support layer serves an algorithm and an application service, and the algorithm used by the invention comprises FP-Growth, hot spot problem classification, information retrieval and automatic abstract extraction.
The application layer specifically provides application services including intelligent public opinion supervision, intelligent consumer perception, and intelligent decision support.
The service layer is divided into 2 terminals: a web side and a mobile side.
The user layer comprises leaders, service personnel and operation and maintenance personnel.
The invention adopts natural language processing technology to develop a set of public opinion decision support system for the hot line of the chief in the city, the system can automatically dig public opinion hotspots every day through a machine learning algorithm, and set a threshold value to carry out public opinion alarm, so as to guide workers to make decision support. The automatic abstract extraction method is adopted to help workers to master the most core and important problems from a large number of related work orders, and the public opinion daily newspaper, weekly newspaper and monthly newspaper are conveniently written and pushed to related departments by self.
The decision support system can display the time of searching the group event from the massive complaint work orders by the staff from the original 2 days to the current real time, greatly improves the working efficiency of the staff and improves the processing speed of public problems. After the system is deployed, the chief hot-line staff can monitor public opinion information in real time, the working efficiency of the staff is greatly improved, a series of civil problems such as cell management confusion, WeChat platform fraud, market activity disturbance to residents and the like are found in time through the system, and active countermeasures are taken.

Claims (7)

1. A city leader hot line public opinion decision support method based on natural language processing is characterized by comprising the following steps:
(1) mining daily hot events of the city chief hot line based on natural language processing;
performing word segmentation and keyword extraction on the complaint work order to construct a keyword FpTree; constructing a condition mode base, wherein the condition mode base is a prefix path of an item set to be mined; constructing a condition FPTree, and recursively mining a frequent keyword item set, namely a hotspot event, on the condition FPTree;
(2) simplifying and classifying the hot events based on the cosine similarity;
(3) a Hash index method matched with multiple keywords is adopted to retrieve a complaint work order containing the keywords of the hot event;
(4) extracting the abstracts of the retrieved complaint worksheets by adopting an improved TextRank algorithm;
processing the work order content into a text containing a plurality of sentences, and converting the sentences into sentence vectors which can be understood by a machine; calculating cosine similarity between sentence vectors to obtain a similarity matrix as edge weight; adopting TF-IDF score as an initial weight value; carrying out TextRank iteration, and calculating the TextRank value of each sentence to obtain a sentence rank; automatically extracting the abstract according to the sentence ranking;
(5) and the staff knows the daily hot events according to the summary report and reports the hot events to relevant departments.
2. The method for supporting a critique hot line public opinion decision based on natural language processing as claimed in claim 1, wherein in the step (1),
the extraction method of the key words is TF-IDF algorithm.
3. The method for supporting a civil hot line public opinion decision based on natural language processing as claimed in claim 1, wherein in the step (1), the constructing keyword FpTree comprises the steps of:
(1.1) setting minimum absolute support, scanning data records, generating a first-level frequent item set of the keywords, and sequencing the first-level frequent item set according to the occurrence times from high to low;
(1.2) scanning the data records again, and sorting the keyword primary frequent item sets generated in the step (1.1) in each record according to the sequence of the step (1.1).
4. The method for supporting a critique hot-line public opinion decision based on natural language processing as claimed in claim 1, wherein the step (2) specifically comprises the steps of:
(2.1) setting a plurality of hot spot classification marks, wherein each hot spot classification mark comprises a plurality of keywords;
(2.2) calculating cosine similarity between all keywords of the hot spot classification mark and the hot spot event;
and (2.3) finding the hot spot classification mark with the maximum cosine similarity to the hot spot event in the hot spot classification marks, and marking the hot spot classification mark on the hot spot event.
5. The method for supporting a crime hot-line public opinion decision based on natural language processing according to claim 4, wherein the cosine similarity calculation comprises the steps of:
1) processing the keyword frequent item set into One-Hot codes;
One-Hot encoding is the representation of classifying variables as binary vectors, mapping the classified values to integer values, and then representing each integer value as a binary vector;
2) and performing cosine similarity calculation on the keyword frequent item set of the One-Hot coding.
6. The method for supporting a critique hot line public opinion decision based on natural language processing as claimed in claim 1, wherein in the step (3),
preprocessing a text database of the captain hot-line complaint work order to obtain a hash table with keywords corresponding to the work order number, and then searching the work order by using the only main key of the work order through multi-keyword search.
7. A city leader hot line public opinion decision support system based on natural language processing, which adopts the city leader hot line public opinion decision support method based on natural language processing according to any one of claims 1 to 6, characterized by comprising a base layer, a data layer, a support layer, an application layer, a service layer and a user layer;
the base layer comprises a machine room and a network environment;
the data layer comprises a basic library and an intelligent library; the basic library is primary data which is original work order text information; the intelligent database is secondary data and is a processed database;
the support layer is an algorithm and application service, and comprises FP-Growth, hot spot problem classification, information retrieval and automatic abstract extraction;
the application layer comprises intelligent public opinion supervision, intelligent folk perception and intelligent decision support;
the service layer comprises a web end and a mobile end;
the user layer comprises government leaders, service personnel and operation and maintenance personnel.
CN202110440120.3A 2021-04-23 2021-04-23 Market leader hot line and public opinion decision support method and system based on natural language processing Active CN112860906B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110440120.3A CN112860906B (en) 2021-04-23 2021-04-23 Market leader hot line and public opinion decision support method and system based on natural language processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110440120.3A CN112860906B (en) 2021-04-23 2021-04-23 Market leader hot line and public opinion decision support method and system based on natural language processing

Publications (2)

Publication Number Publication Date
CN112860906A CN112860906A (en) 2021-05-28
CN112860906B true CN112860906B (en) 2021-07-16

Family

ID=75992807

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110440120.3A Active CN112860906B (en) 2021-04-23 2021-04-23 Market leader hot line and public opinion decision support method and system based on natural language processing

Country Status (1)

Country Link
CN (1) CN112860906B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113312532B (en) * 2021-06-01 2022-10-21 哈尔滨工业大学 Public opinion grade prediction method based on deep learning and oriented to public inspection field
CN113254755B (en) * 2021-07-19 2021-10-08 南京烽火星空通信发展有限公司 Public opinion parallel association mining method based on distributed framework
CN114510566B (en) * 2021-11-29 2023-07-07 上海市黄浦区城市运行管理中心(上海市黄浦区城市网格化综合管理中心、上海市黄浦区大数据中心) Method and system for mining, classifying and analyzing hotword based on worksheet
CN114492434B (en) * 2022-01-27 2022-10-11 圆通速递有限公司 Intelligent waybill number identification method based on waybill number automatic identification model

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106815709A (en) * 2016-12-06 2017-06-09 国网福建省电力有限公司 One kind service quick response center support system and method
CN110888970A (en) * 2019-11-29 2020-03-17 腾讯科技(深圳)有限公司 Text generation method, device, terminal and storage medium
CN110990676A (en) * 2019-11-28 2020-04-10 福建亿榕信息技术有限公司 Social media hotspot topic extraction method and system
CN112560445A (en) * 2020-12-05 2021-03-26 上饶市中科院云计算中心大数据研究院 Method and device for detecting hot line hot spot appeal topics of captain
CN112685555A (en) * 2019-10-17 2021-04-20 中国移动通信集团浙江有限公司 Complaint work order quality detection method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106815709A (en) * 2016-12-06 2017-06-09 国网福建省电力有限公司 One kind service quick response center support system and method
CN112685555A (en) * 2019-10-17 2021-04-20 中国移动通信集团浙江有限公司 Complaint work order quality detection method and device
CN110990676A (en) * 2019-11-28 2020-04-10 福建亿榕信息技术有限公司 Social media hotspot topic extraction method and system
CN110888970A (en) * 2019-11-29 2020-03-17 腾讯科技(深圳)有限公司 Text generation method, device, terminal and storage medium
CN112560445A (en) * 2020-12-05 2021-03-26 上饶市中科院云计算中心大数据研究院 Method and device for detecting hot line hot spot appeal topics of captain

Also Published As

Publication number Publication date
CN112860906A (en) 2021-05-28

Similar Documents

Publication Publication Date Title
CN112860906B (en) Market leader hot line and public opinion decision support method and system based on natural language processing
Mooney et al. Sequential pattern mining--approaches and algorithms
Bilenko et al. Adaptive name matching in information integration
Sebastiani Classification of text, automatic
US8356045B2 (en) Method to identify common structures in formatted text documents
Gu et al. Record linkage: Current practice and future directions
Weiss et al. Text mining: predictive methods for analyzing unstructured information
CN101079024B (en) Special word list dynamic generation system and method
US20100287466A1 (en) Method for organizing large numbers of documents
US20150019544A1 (en) Information service for facts extracted from differing sources on a wide area network
CN113254659A (en) File studying and judging method and system based on knowledge graph technology
CN110795932B (en) Geological report text information extraction method based on geological ontology
US9594755B2 (en) Electronic document repository system
CN111899821A (en) Method for processing medical institution data, method and device for constructing database
CN112328794A (en) Typhoon event information aggregation method
CN113239111A (en) Network public opinion visual analysis method and system based on knowledge graph
CN111753514A (en) Automatic generation method and device of patent application text
Conklin Mining contour sequences for significant closed patterns
US20230076773A1 (en) Knowledge base with type discovery
CN115759253A (en) Power grid operation and maintenance knowledge map construction method and system
Seyfi Mining discriminative itemsets in data streams using different window models
Ibrahim et al. Exquisite: explaining quantities in text
CN114077653A (en) Universal document data flexible retrieval system and method
Olegovich Dorodnykh et al. Using the Semantic Annotation of Web Table Data for Knowledge Base Construction
Roy et al. Recovering resolutions for application maintenance incidents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant